Tuesday, December 06, 2011

Eventual Consistency and Hayloft

Given my thirty plus years of experience in distributed computing, it's a given that I've had to more or less continuously grapple with the Fallacies of Distributed Computing as laid out by L Peter Deutsch, James Gosling, and others:

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn't change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

The implications of these myths led Eric Brewer to make Brewer's Conjecture, which became the CAP Theorem once it was proven by Seth Gilbert and Nancy Lynch.

A distributed system can guarantee at most two of the three following qualities:

  1. all nodes can see the same data at the same time (consistency);
  2. every request receives a response as to whether it succeeded or failed (availability);
  3. the system continues to operate despite message loss (partition tolerance).

I've been thinking a lot about the implications of this for storage systems ever since attending the 27th IEEE Symposium on Massive Storage Systems and Technologies, a.k.a. MSST 2011. One of the big take aways I had from that conference was that distributed storage systems cannot reliably support POSIX semantics, in particular the POSIX features of file locking and consistency in file modification. Thanks to the CAP Theorem, POSIX semantics don't scale.

(I find that all of the really interesting problems are scalability problems, regardless of the problem domain. If solving a problem at scale n seems easy, trying applying the same solution at scale n x 1000. Not so much. That's why domains like cloud computing and exascale high performance systems have caught my interest. You can't just take current architectures and multiple by one thousand.)

This is one of the reasons that cloud storage has become popular: the various web-based mechanism for data transfer used by the cloud, like the RESTful HTTP GET and PUT, and XML-based SOAP, have semantics that have proven to be highly scalable. Hence they map well to large distributed system architectures.

It was this very issue that lead Amazon.com to adopt a strategy for its globally distributed Amazon Web Services (AWS), including S3, its non-POSIX-compliant Simple Storage Service, which its CTO Werner Vogels refers to as eventually consistent, choosing availability and partition tolerance over consistency:

Data inconsistency in large-scale reliable distributed systems has to be tolerated for two reasons: improving read and write performance under highly concurrent conditions; and handling partition cases where a majority model would render part of the system unavailable even though the nodes are up and running.

Indeed, the well publicized failures in AWS have been not in S3, but in its Elastic Block Store (EBS), which tries to support POSIX semantics on top of a globally distributed system.

I can see the Fallacies, and the eventual consistency architecture, at work when I play with Hayloft, my little C++ research project that uses S3. Any realistic application using S3 has to deal with network failures, where simply retrying the operation may be sufficient. But it also has to deal with the issues of consistency convergence: the fact that every S3 action may take place on a different AWS server that doesn't yet know what actions have taken place on other AWS servers, even though those actions may have been performed on the same storage objects.

Hayloft features both a synchronous and an asynchronous interface. The former makes it easy to play with. The latter is more likely to be what a production application would use. The asynchronous interface features a mechanism to block until an S3 action completes, or to execute multiple S3 actions incrementally and in parallel. Unlike my prior article on Hayloft, I'll use the asynchronous interface here, but use the blocking completion mechanism because it's simpler. (As before, all of this code is taken from working unit tests in the Hayloft distribution, but edited for readability. Apologies for any typos.)

Here is a code snippet that creates a bucket to hold objects. It starts the action and blocks until it completes.

Multiplex multiplex;
BucketCreate bucket("Bucket", multiplex);
bucket.start();
multiplex.complete();

But that's not sufficient. Sometimes (rarely), the action will return a status indicating a temporary failure, like "connection failed", "name lookup error", or "request timed out". The network isn't reliable, and the failure more likely to be on my end, or somewhere along the convoluted path between my system and S3, than in S3 itself. Simply retrying the action after a one second delay is usually sufficient unless the outage is severe. (Note that "success" is not a retryable condition.)

Multiplex multiplex;
BucketCreate bucket("Bucket", multiplex);
for (int ii = 0; ii < 10; ++ii) {
bucket.start();
multiplex.complete();
if (!bucket.isRetryable()) { break; }
platform.yield(platform.frequency());
}

Once the bucket is created, we can store an object in it. Here's a code snippet similar to the one above except it has to deal with rewinding the input data source if we need to retry the action. You can see that this might get a smidge complicated depending on where your data is coming from.

PathInput * source = new PathInput("./file.txt");
Size bytes = size(*input);
ObjectPut put("InputFile.txt", bucket, multiplex, source, bytes);
for (int ii = 0; ii < 10; ++ii) {
put.start();
multiplex.complete();
if (!put.isRetryable()) { break; }
platform.yield(platform.frequency());
source = new PathInput("./file.txt");
bytes = size(*input);
put.reset(source, bytes);
}

But network outages aren't the only failures we have to worry about. The following code snippet retrieves the metadata for the object we just created.

ObjectHead head(put, multiplex);
for (int ii = 0; ii < 10; ++ii) {
head.start();
multiplex.complete();
if (!head.isRetryable()) { break; }
platform.yield(platform.frequency());
}

This usually works. Except when it doesn't. Sometimes (and much more frequently than I see retries due to temporary network failures) the action will return a non-retryable status indicating the object wasn't found. WTF?

The object head action was serviced by a different AWS server than that of the object put action, one that hadn't yet been notified of the object put. This is eventual consistency in action. The unit test tries a code snippet like the following, treating the non-existence of the object as a retryable error, since it knows darn well the object put was successful.

ObjectHead head(put, multiplex);
for (int ii = 0; ii < 10; ++ii) {
head.start();
multiplex.complete();
if (head.isRetryable()) {
// Keep trying.
} else if (head.isNonexistent()) {
// Keep trying.
} else {
break;
}
platform.yield(platform.frequency());
}

Eventually this works. But it's even more complicated than that. If you ran this exact same code snippet again, it might still fail at first, because it was run on yet another server which had not yet been updated to be consistent with the others. In fact, its success on a subsequent attempt may be because it just happened to be executed on the original server, not because all the servers had been updated. Applications have to be designed around this. Amazon's S3 best practices document recommends not using code snippets like this to verify the successful put of an object.

The unit tests log a message every time they retry. I can actually watch how long it takes for the AWS servers to become consistent, and how this consistency convergence changes with the load on the system. Typically it converges so quickly that the unit test doesn't have to retry. Rarely, it takes a second or two. Or sometimes, not so rarely. I was testing a lot of this code on the day after the Thanksgiving holiday in the United States, also known as Black Friday, typically the busiest shopping day of the Christmas season. It frequently took two or three seconds for the system to converge to a consistent state. I haven't seen this kind of latency before or since. But, to its credit, S3 always converged and the unit tests all eventually ran successfully to completion.

Whether or not this is an issue for you depends on your S3 application. For a lot of applications, it won't matter. For applications where read-after-write consistency is important, AWS offers a higher (more expensive) tier of service where objects are co-located in one of their big data centers instead of possibly being spread out among multiple centers; this tier offers read-after-write consistency. If immediate consistency is an issue for you, S3 may not be right platform for your application. However, I would argue that distributed computing isn't the right paradigm for you.

The cloud is not the place for everyone.

No comments: