Tuesday, August 25, 2009

Compression, encryption and deduplication

When you are doing backups or archival - compression and encryption is a must. Compression is necessary to save precious bandwidth while encryption is necessary when you store your data outside the boundaries of your data center/home. Now considering that you are archiving data for long periods, you would want to deduplicate it to reduce the amount of storage. But encryption/compression are quite incompatible with deduplication. Deduplication tries to find same blocks in the data set so that the blocks can be shared and storage space can be saved. But even a small change in your data could cause the encryption/compression algorithms to produce very different outputs, fooling the poor little deduplication engine. Deduplication loves patterns in the data and good encryption algorithms try hard to remove patterns from the data for better security.

Some googling led me to rsyncrypto and rsyncable-gzip - where the rsync algorithm is modified to be compression/encryption friendly. rsync has an excellent algorithm with which only the changed parts of the data need to be sent over the network for syncing data sets.

rsyncable-gzip is a patch to gzip which cause the compression to be done in chunks rather than processing the entire file in one go. This localizes changes within the compressed binary allowing rsync to do a better job. This can lead to lower compression ratios in some cases.

rsyncrypto modifies the standard encryption schema by localizing the effects of encryption to keep the side-effect changes minimal. This again allows rsync to work much more efficiently. Again this may reduce the efficiency of the encryption algorithm but it will still be good enough for most use cases.

Now obviously this problem has an easier solution - deduplicate first, then compress, then encrypt. But this flow may not be possible always.

Wednesday, August 5, 2009

Rackspace CloudFiles

CloudFiles is a cloud storage offering from Rackspace. You can use it for archiving and backing up your web-based storage data. The data is replicated across 3 locations giving you excellent redundancy.

I tried out their offering and here are some of my notes:

- Max file size of 5GB - most cloud storage vendors have these limits mostly due to protocol limitations
- Along with CloudFiles they provide Content Distribution Network(CDN) for distributing data across their data centers around the world.
- 15 cents/GB/month storage cost. 8 cents inbound and 22 cents outbound/GB
- APIs in multiple languages - PHP, Java, C#, Python, Ruby
- They have a "browser panel" which can be used to upload files and distribute using CDN. - Mozilla extension + Mac app to access CloudFiles storage is available. Not maintained by Rackspace though and *not* a backup tool, its just a front-end.
- Provides SSL for security
- Tokens expire every 24 hours and clients need to reconnect. Every operation must use a authentication token.
- FireUploader simulates a hierarchical file system structure which is not possible with CloudFiles.
- Cyberduck used on Mac is a GPL software and a good frontend for CloudFiles

- The only way user can access the content from another account is if they share their username/API access key or a session token.
- Files are called "objects" in Rackspace lingo. Data is saved as-is without compression/encryption. It does support metadata in form of key-value pairs so you can "tag" files and organize your data. Metadata is limited to 4KB and max of 90 individual tags can be stored.
- You can enable CDN on certain containers. Each CDN enabled container has a unique Uniform Resource Locator(URL). For example a CDN enabled container named "photos" can be referenced as http://cdn.cloudfiles.mosso.com/c3131 - if this container has an image named "baby.jpg" then that image can be served through LimeLight Networks CDN with the URL of http://cdn.cloudfiles.mosso.com/c3131/wow1.jpg. This is how we can enable SaaS applications to use data stored on our customer's accounts.
- Code snippets and documentation of APIs is good and code examples are available.
- The APIs are Restful i.e. Representational State Transfer protocol.

- You need to do a GET call on the account to get a list of all containers in the account. You can get this information in JSON or XML format as well.
- Max of 10000 container names can be returned at a time. You can make a continuation call with a "marker" if you want to retrieve later containers.
- HEAD call is used to find out number of containers in the account and the number of used bytes
- GET operation on a container is used to list objects in the container.
- Pseudo hierarchical folders/directories
- Users will be able to simulate a hierarchical structure in Cloud Files by following a few guidelines. Object names must contain the forward slash character ‘/’ as a path element separator and also create “directory marker” Objects, then they will be able to traverse this nested structure with the new “path” query parameter. This can best be illustrated by example: For the purposes of this example, the Container where the Objects reside is called “backups”. All Objects in this example start with a prefix of “photos” and should NOT be confused with the Container name.
In the example, the following “real” Objects are uploaded to the storage system with names representing their full filesystem path.

photos/animals/dogs/poodle.jpg
photos/animals/dogs/terrier.jpg
photos/animals/cats/persian.jpg
photos/animals/cats/siamese.jpg
photos/plants/fern.jpg
photos/plants/rose.jpg
photos/me.jpg

To take advantage of this feature, the “directory marker” Objects must also be created to represent the appropriate directories. The following additional Objects need to be created. A good convention would be to create these as zero or one byte files with a Content-Type of “application/directory”.
photos/animals/dogs
photos/animals/cats
photos/animals
photos/plants
photos

Now issuing a GET request against the Container name coupled with the “path” query parameter of the directory to list can traverse these “directories”. Only the request line and results are depicted below excluding other request/response headers.

GET /v1/AccountString/backups?path=photos HTTP/1.1

photos/animals
photos/cats
photos/me.jpg

To traverse down into the “animals” directory, specify that path. GET /v1/AccountString/backups?path=photos/animals

photos/animals/dogs
photos/animals/cats

By combining this “path” query parmater with the “format” query parameter, users will be able to easily distinguish between virtual folders/directories by Content-Type and build interfaces that allow traversal of the pseudo-nested structure.

- DELETE call on container will not succeed if it has objects
- HEAD operation on object is used to retrieve object metadata and standard HTTP headers
- GET call is used to used to retrieve object data. It supports headers like If-Match, If-None-Match, If-Modified-Since, If-Unmodified-Since. It is possible to get a range of bytes as well.
- PUT call is used to write or overwrite an object's metadata and content. End-to-end data integrity can be ensured by including an MD5 checksum of your objects data in the Etag header.
- Chunked requests can be sent if you do not know the size of the object you are PUT'ing but total size must be <5gb>

Short-comings mentioned by Rackspace themselves:
- cannot mount or map CloudFiles account as a network drive.
- files cannot be modified so block level changes cannot be done.
- containers cannot be nested
- No ACLs for security

cURL is a command line tool available on most UNIX environments and it allows you to transmit and receive HTTP requests and responses from the command-line or from within a shell script. So you can work with the ReST API directly instead of using client APIs. I used the cURL tool to test out the calls provided by CloudFiles.