Friday, July 21, 2006

Amazon S3 Part 1

Two days ago I began looking into Amazon's S3 web service. For those of you who don't know, S3 is described as follows by Amazon:

"Amazon S3 is storage for the Internet. It is designed to make web-scale computing easier for developers."

Hmm...web-scale computing made easy...as a geek there's no way I'm not checking this out, so I went ahead and signed up for an account. The signup process was really easy, I just had to break out my Discover card and they promised to bill me monthly, excellent deal. Once I was signed in I had access to all of the documentation and example code. The cost by the way are as follows:

Pricing
* Pay only for what you use. There is no minimum fee, and no start-up cost.
* $0.15 per GB-Month of storage used.
* $0.20 per GB of data transferred.

I figure I'll owe them a few pennies each month, no big deal. I really just want to play around and figure out what kind of cool stuff I could build if I had the time.

I downloaded the .NET C# SOAP example. Installed it, put in my super secret key, recompiled and kaboom, I was up and running. The example showed me how to program the core concepts with S3 via a command prompt. Here are the technical concepts a developer would need to understand:

1. Buckets

A bucket is simply a container for objects. Each user can have up to 100 buckets. This may sound low, but in reality you only need 1 bucket.

2. Objects

Objects are like files, but they have meta-data around them. Meta-data is data about the objects, key/value pairs. You also have to setup ACL's (Access Control Lists) for each object. You can have an unlimited number of objects. At first glance you may be wondering how to organize all of the objects in a bucket. What is really cool is that you can can use any type of delimeter you want to group objects. UNIX people are use to the "/" seperator, Windows uses the "\" seperator, but you can use whatever fits your application.

3. Keys

Every object has a unique key.

4. Operations

Example operations:

a. Create a bucket
b. Write an object
c. Read an object
d. Delete an object
e. List Keys

That's it. Pretty easy concepts to understand, but it's pretty powerful. So the example project showed me the basic concepts but I wanted to build something useful. So I decided to improve "My Internet Based File System" by creating a program that will allow me to

1. Upload a folder to S3
2. View a list of objects in my S3 account.
3. Download the object to my local hard disk.
4. Delete objects from S3.

After about an hour I had a working version. The hardest part was fixing their example code to handle binary files as well as text files. Once I got that it was just a matter of hammering out the code.

Here is a screenshot of the working version (and yes I do design work on the side):


The main problem with this code is that it reads the entire file into memory before sending the object to S3. Not a problem with small files, but if you ever wanted to upload a big mp3 or something it wouldn't work. In order to get this to work, you really need to "stream" the object to S3. But, doing this via SOAP is rather hard. The basic problem is that SOAP is primarily XML going back and forth. You would need to either dig deep and format the objects yourself (not recommended) or use a concept called "DIME Attachments". Good luck finding example code. And a bigger problem for me was that Microsoft switch their "Web Services Enhancements (WSE)" to use MTOM instead of DIME between versions 2.0 and 3.0. I don't really have the time to try to get this mess working. But I have some other ideas I'm going to play with first to see if I can get this working in a simpler manner, stay tuned.

If you are wondering how S3 works on the backend, the API docs give some clues, here is one:

"If the object already exists in the bucket, the new object overwrites the existing object. S3 orders all of the requests that it receives. It is possible that if you send two requests nearly simultaneously, we will receive them in a different order than they were sent. The last request received is the one which is stored in S3. Note that this means if multiple parties are simultaneously writing to the same object, they may all get a successful response even though only one of them wins in the end. This is because S3 is a distributed system and it may take a few seconds for one part of the system to realize that another part has received an object update. In this release of Amazon S3, there is no ability to lock an object for writing -- such functionality, if required, should be provided at the application layer."