Rumble in the Mind: Compression, encryption and deduplication

Tuesday, August 25, 2009

Compression, encryption and deduplication

When you are doing backups or archival - compression and encryption is a must. Compression is necessary to save precious bandwidth while encryption is necessary when you store your data outside the boundaries of your data center/home. Now considering that you are archiving data for long periods, you would want to deduplicate it to reduce the amount of storage. But encryption/compression are quite incompatible with deduplication. Deduplication tries to find same blocks in the data set so that the blocks can be shared and storage space can be saved. But even a small change in your data could cause the encryption/compression algorithms to produce very different outputs, fooling the poor little deduplication engine. Deduplication loves patterns in the data and good encryption algorithms try hard to remove patterns from the data for better security.

Some googling led me to rsyncrypto and rsyncable-gzip - where the rsync algorithm is modified to be compression/encryption friendly. rsync has an excellent algorithm with which only the changed parts of the data need to be sent over the network for syncing data sets.

rsyncable-gzip is a patch to gzip which cause the compression to be done in chunks rather than processing the entire file in one go. This localizes changes within the compressed binary allowing rsync to do a better job. This can lead to lower compression ratios in some cases.

rsyncrypto modifies the standard encryption schema by localizing the effects of encryption to keep the side-effect changes minimal. This again allows rsync to work much more efficiently. Again this may reduce the efficiency of the encryption algorithm but it will still be good enough for most use cases.

Now obviously this problem has an easier solution - deduplicate first, then compress, then encrypt. But this flow may not be possible always.

6 comments:

Anand said...: The sequence would always be like this - dedup first, then compress and then encrypt. Since compression and dedup do somewhat similar activities (in essence, not by implementation), compression would always mess up dedup activities. Hence compression and dedup should work independent of each other. Last, the encryption is needed to have a better security scenario.; September 22, 2009 at 11:06 PM
Unknown said...: In this case, dedup first is not possible because all of the data is not available on-premise, its in the cloud. And the data in the cloud must be encrypted and generally compressed.; September 22, 2009 at 11:26 PM
Anand said...: If the data is distributed, dedup would take a whole new dimension. Any typical distributed file system would have constituent nodes of equal priority. And to dedup such file system, all nodes need to communicate with each other to retrieve/update the meta data for dedup (typically the dedup database).
In cloud computing scenarios, cloud is the data store while nodes could be caching data in order to use it. If you are a cloud user, you would simply need a decryption and decompress engine while dedup happens transparently at the cloud.; September 23, 2009 at 12:12 AM
Anonymous said...: This comment has been removed by a blog administrator.; December 29, 2009 at 9:41 PM
Anonymous said...: Good brief and this post helped me alot in my college assignement. Thank you seeking your information.; February 14, 2010 at 5:47 PM
Anonymous said...: To start earning money with your blog, initially use Google Adsense but gradually as your traffic increases, keep adding more and more money making programs to your site.; February 24, 2010 at 5:03 PM

Rumble in the Mind

Tuesday, August 25, 2009

Compression, encryption and deduplication

6 comments:

Blog Archive

About Me

My Blog List

Followers