Agile Zone is brought to you in partnership with:

I am a programmer and architect (the kind that writes code) with a focus on testing and open source; I maintain the PHPUnit_Selenium project. I believe programming is one of the hardest and most beautiful jobs in the world. Giorgio is a DZone MVB and is not an employee of DZone and has posted 637 posts at DZone. You can read more from them at their website. View Full User Profile

Eventual consistency is everywhere in the real world

11.08.2011
| 8146 views |
  • submit to reddit

Our habit, at least in the 2000s, was to build web sites and web applications backed by a relational database. In my case these web products were built with PHP, a language that does not usually spawn resident processes running on the server, but only consists of scripts started on demand.

This kind of storage and language gave me (and some other millions LAMP developers) a bias towards the immediately consistent model: a transaction is performed during each HTTP request, and at the end of it everything that should have been changed is consistently in the new state; every row in every table of the database is up to date.
Today we're seeing a transition towards a model full of NoSQL databases, often based on eventual consistency; towards message queues, event-driven programming and asynchronous processing; have you heard of Gearman?

These concrete examples are taken from software development and from the real world: I hope they aid you in adding the new paradigm to your toolbox (without necessarily disregarding the old one) and in evangelizing it with other people.

What's reality now?

Although many architectural choices try to hide this fact, our applications can be built as distributed systems running on different machines. It's not the time for defaulting to a MySQL connection to localhost or to a machine in the Intranet anymore.

Our data can be kept in different machines and different projects, owned by different people (integrated anything with Facebook?) Eventual consistency is the only way to scale in these scenarios; and by scale I do not mean supporting a million user on a single CPU, but just building a useful application that does not have to wait 60 seconds for each page to load.

What was the world about?

Once upon a time, everything worked with paper (at least until several tens of years ago); immediate consistency is an invention of IT people. When paper sheets had to travel between organizations and different places, there was no requirement for immediate consistency, as it wasn't even possible.

Sometimes immediate consistency is an improvement that makes the fortune of a startup: Dropbox and Google Docs eliminate the pain of synchronization and multiple versions of files and documents in a beautiful way.

Sometimes the business doesn't care really much (or at all) for forcing consistency over a large set of data.

Amazon recommendations

Amazon has a nice system for increasing sales: presenting recommendations on the page of a product, basing on which products Amazon's users has bought in the past:
The set of bought-together-products for recommendations are usually built over the whole incidence matrix users/products. Updating immediately such a model would be impossible: every time a product is bought the page would be locked up.

So how Amazon scales to selling hundreds of products continuously and generating their pages containing recommendations at the same time? In the simplest way: with a little layer of caching.

In fact, almost every cache is based on eventual consistency. I would exclude cases when cache invalidation is implemented deterministically; in all the other cases, caches are the most diffused examples of eventual consistency. The difference between them and NoSQL consistency is that many caches expire after a finite amount of time. You can cache an HTML page for 10 seconds, but the content of a CouchDB view would always favor availability to consistency; even in the case where the results are very old, querying a CouchDB view would never block waiting for recomputation.

Google search results

The same consistency model applies to Google (and any search engine): crawling takes a finite amount of time, and search results are at least always inconsistent with the current state of the web. It couldn't be different.

However, there is another level of eventual consistency. Although the indexes are probably updated hourly in 2011, spreading updates across data centers took quite a while in the past, in a phenomenon called Google dance.

During a Google dance, each data center would return different results and ordering in SERPs, while the PageRank of each result was updated. Yet Google didn't suffer this occasional lack of consistency and went on to become a web superpower.

Dropbox

According to the CAP theorem, we can only achieve two properties in a distributed system, chosen between Consistency, Availability and Partition Tolerance.

Although Dropbox enabled immediate consistency via synchronization in many cases, it must work with Partition Tolerance: it's a distributed system between your devices (pc, tablets, smartphones) which you can use even without a connection. You can work on files in Dropbox's local folder while commuting, in a gallery, on the sea, in the mountains, where 3G is not available or when you want to save battery or bandwidth for later.

Thus Dropbox promises to give you a folder that syncs. It hides lots of complexity under the carpet, like its delta compression algorithms and management of conflicts, or its syncing on LAN whenever possible and syncing over the Internet in any other case.

Dropbox embraces eventual consistency: transactional consistency would be impossible to obtain over a large network like Internet as your word processor would freeze each time you hit Ctrl+S and the updates are pushed to other devices. Immediate consistency is also impossible to obtain in case of a partition in the network like the absence of a working connection.

The final result? You work on your document, which will be updated when the connection is available. And then updated on your other devices when their connection is available. Sometimes it takes a lot to sync, if you have large new files to upload. But you can continue working, although not from multiple devices at the same time (Dropbox is oriented to personal syncing, not on collaboration. So it's not a real limitation.)

The hottest startup of the moment, based on explicitly avoiding immediate consistency and getting away with it.

Banking and money

It still takes some days to transfer money between Italian banks. You can imagine vans full of bank notes and policemen driving from one bank to another. Actually today banks communicate over a digital network, but still the protocols work as before. Wire transfers are scheduled for the end of the day, and transfer are made off hours (probably to simplify the accounting rules).

Bank transfers are always chosen as the example of a database transaction. That scenario may be true only when the accounts are situated in the same bank: it's not uncommon to transfer money between San Paolo and ING Direct and actually see the money vanish for some days as it is subtracted from one account and not yet added to the other one (someone with more knowledge in the banking domain could clarify this scenario.)

Of course banks provide every kind of logs so that money never disappears, but if you observe the external state of the system you will notice that at a first glance you are poorer while a wire transfer is taking place.

Again, sometimes immediate consistency is an innovation: Paypal transfers money overall the world in seconds, allowing a buyer to checkout on an e-commerce site istantaneously.

However, vendors accepting Paypal will often tell you We have accepted your order, now wait for us to come back to you... while they look for the physical item in their warehouses. Online offers are not always consistent with the actual inventory...

Published at DZone with permission of Giorgio Sironi, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Tomas Dermisek replied on Wed, 2011/11/09 - 6:07am

Hi Giorgio, I have just recently listened to very nice podcast about this topic going into more details: http://www.se-radio.net/2010/07/episode-165-nosql-and-mongodb-with-dwight-merriman/ .. if someone is interested. Regards, Tomas

Giorgio Sironi replied on Wed, 2011/11/09 - 4:21pm in response to: Tomas Dermisek

It will go into my playlist, thanks. :)

Nicolas Bousquet replied on Mon, 2011/11/14 - 6:35am

By default the real world is no consistent, I agree. We communicate, work and behave using partial information, using past state.

But, like you said for bank transactions, and what is also true for eCommerce and many other fields, you have to manage this eventual consistency to make it appear consistent.

If a customer has to wait 3 weeks instead of 3 day to receive its purshase because you failed to have a consistent inventory and provide a reliable shipping delay, you just lost a customer. Because if the gift he wanted for he wife come 2 weeks after her birthday, you simply failed. NoSQL or not. Eventual consistency or not.

RDBMS databases provide this abstraction for you. Hardware is unreliable. Consistent replication is complex. Concurrent read/write with ACID is quite complex. But RDBMS manage it really well. And this is fast. A typical transaction require only a few ms and you can have thousand per second... Manage TB of data and have still it fast.

Yes NoSQL databases are typically faster (if you don't need to replicate ACID by yourself on top of it), yes they scale better (althrough Oracle RAC scale up to 1 hundred servers, not too bad if you ask me). And there are cases when you want to use them.

 But like you said with cache, this is not new. This has always been like that, we have many caching layers, we compute some data offline, email notifications are put on a queue... And that's ok. It was already the case 30 years ago.

The key is to choose the right tool for the job. RDBMS is simply a productivity boost when you need consistency. It is also a very mature solution that will serve you well in most case because not everybody need to deal with hundred millions users anyway.

But when you need a caching mechanism, when you need absolute performence at very low cost, NoSQL will give you that... At the expense of development time and managing complexity yourself.

 

 

 

Giorgio Sironi replied on Sun, 2011/11/20 - 9:27am in response to: Nicolas Bousquet

Thanks for adding other business-related examples. I think in the case of a birthday gift, the freedom we get by abandoning immediate consistency is that we can ensure client requests are satisfied on a day-level instead of on a millisecond-level. The form of eventual consistency I seek is decoupling the propagation of data from the first storagee to the others (or to its views), but still tune the system to ensure that, for example, a purchase request is fulfilled in less than a day.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.