Tuesday, August 25, 2009

Compression, encryption and deduplication

When you are doing backups or archival - compression and encryption is a must. Compression is necessary to save precious bandwidth while encryption is necessary when you store your data outside the boundaries of your data center/home. Now considering that you are archiving data for long periods, you would want to deduplicate it to reduce the amount of storage. But encryption/compression are quite incompatible with deduplication. Deduplication tries to find same blocks in the data set so that the blocks can be shared and storage space can be saved. But even a small change in your data could cause the encryption/compression algorithms to produce very different outputs, fooling the poor little deduplication engine. Deduplication loves patterns in the data and good encryption algorithms try hard to remove patterns from the data for better security.

Some googling led me to rsyncrypto and rsyncable-gzip - where the rsync algorithm is modified to be compression/encryption friendly. rsync has an excellent algorithm with which only the changed parts of the data need to be sent over the network for syncing data sets.

rsyncable-gzip is a patch to gzip which cause the compression to be done in chunks rather than processing the entire file in one go. This localizes changes within the compressed binary allowing rsync to do a better job. This can lead to lower compression ratios in some cases.

rsyncrypto modifies the standard encryption schema by localizing the effects of encryption to keep the side-effect changes minimal. This again allows rsync to work much more efficiently. Again this may reduce the efficiency of the encryption algorithm but it will still be good enough for most use cases.

Now obviously this problem has an easier solution - deduplicate first, then compress, then encrypt. But this flow may not be possible always.

6 comments:

Anand said...

The sequence would always be like this - dedup first, then compress and then encrypt. Since compression and dedup do somewhat similar activities (in essence, not by implementation), compression would always mess up dedup activities. Hence compression and dedup should work independent of each other. Last, the encryption is needed to have a better security scenario.

Unknown said...

In this case, dedup first is not possible because all of the data is not available on-premise, its in the cloud. And the data in the cloud must be encrypted and generally compressed.

Anand said...

If the data is distributed, dedup would take a whole new dimension. Any typical distributed file system would have constituent nodes of equal priority. And to dedup such file system, all nodes need to communicate with each other to retrieve/update the meta data for dedup (typically the dedup database).
In cloud computing scenarios, cloud is the data store while nodes could be caching data in order to use it. If you are a cloud user, you would simply need a decryption and decompress engine while dedup happens transparently at the cloud.

Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...

Good brief and this post helped me alot in my college assignement. Thank you seeking your information.

Anonymous said...

To start earning money with your blog, initially use Google Adsense but gradually as your traffic increases, keep adding more and more money making programs to your site.