Massive scalability is a key component of elasticity that in turn is the key advantage of cloud computing. Handling massive amounts of data is far from easy whether you use cloud computing or not. To get the real benefits of the cloud there are a couple of limiting factors that needs to be considered – at least that is the way the official dogma goes.
We cannot “have-it-all” with big data
Many seasoned developers/archtects are used to working with, or even designing, databases that offer perfect consistency and very good availability. Sadly, this is a more challenging task with big data.
Spreading out the data on multiple machines is a good way to improve availability. That will enable your solution to serve more requests per time unit and also gives you the opportunity to implement automatic failover.
However, if we use that way to improve availability it will impact consistency. If we save data to machine A and that machine immediately fails, machine B will take over so that our availability remains high. The consequence is that the data that we just saved to machine A will not be reflected on machine B, i.e. our consistency will be less than perfect.
According to the CAP theorem (I will not explain that here) we have to prioritize between availability and consistency. This is a tough choice since we generally want both.
What kind of choices are available?
The choices that we do have is to forfeit consistency or to forfeit availability. This is not as dramatic as it sounds, since we would still have good consistency and good availability! However, sometimes consistency is of utmost importance; It is required that all data is completely consistent at all times in all machines. In such a scenario we have to accept reduced availability. On the other hand, if availability is too important to reduce we can choose to reduce the data consistency. Here are some terms that is important to understand before reading the rest of this blog post:
Eventual consistency is a term that was popularized by Werner Vogels. Data is saved and committed to some but not all machines (often called nodes) before your write request returns. The advantages include that only some of the machines must be available when data is saved and that some time may be saved due to contacting fewer machines before the saving operation returns. However, this leaves a window of inconsistency, meaning that during a period of time requests might be served using old data. Working on reducing the inconsistency window so that it is closed before the next expected request arrives is a recommended strategy.
Read your writes means that if a process writes data it will always have access to that updated data and will not have to deal with older values. (This is only interesting to discuss when dealing with eventual consistency.)
Monotonic reads means that if a process once accesses a data entry it will never be presented to an older version of that data on a later occasion. (This is only interesting to discuss when dealing with eventual consistency.)
The big cloud vendors Amazon, Microsoft and Google all offer data stores suitable for building cloud solutions with huge amounts of data. Amazon and Google offer eventually consistent reads and strongly consistent reads in their products, but Microsoft do not offer any such options. In Microsoft Azure consistent reads is the only option.
Do these choices make any difference?
A very interesting study on data consistency in the cloud that compare these cloud vendors was published earlier this year (2011) by a bunch of researchers based in Australia. Their findings are very interesting and is summarized in the table below.
|Vendor||Product||Option||Consistent after||Read your writes||Monotonic read|
|Amazon||SimpleDB||Eventually consistent read||500 ms||No||No|
|Amazon||SimpleDB||Consistent read||0 ms||Yes||Yes|
|Amazon||S3||Reduced redundancy||0 ms||Yes||Yes|
|Amazon||S3||Standard redundancy||0 ms||Yes||Yes|
|Microsoft||Azure Table||(no option available)||0 ms||Yes||Yes|
|Microsoft||Azure Blob||(no option available)||0 ms||Yes||Yes|
|App Engine Data Store||Strong consistent read||0 ms||Yes||Yes|
|App Engine Data Store||Eventual consistent read||0 ms*||Yes*||?*|
Interestingly, the results show that during these tests only one option (Amazon’s SimpleDB with Eventually consistent read) gave rise to situations where the reader of the data saw any effects of eventual consistency. Another interesting finding is that SimpleDB using the Consistent reads option was slightly faster – contrary to some of the hoped for benefits of choosing eventual consistency.
As for the Google App Engine Data Store using the Eventual consistent read option the results presented in the table above are marked with an asterisk (*) and here’s why: 11 out of 3,311,081 read operations returned stale data when reader and writer were not running in the same application. This consistency level is very high for an eventual consistency option. The explanation for these results might be that data is fetched from a secondary replica only if the primary one is unavailable. Since stale values were only returned when reader and writer was running in different applications Read you writes consistency seems to be offered.
Based on the findings in this research this is what I recommend you to do when you are working with massive amounts of data in the cloud:
- Express requirements regarding availability and consistency in business terms
- Carefully consider your availability and consistency options with your business needs and implementation costs in mind
- Perform tests with realistic machine configurations, realistic amounts of traffic and realistic amounts & structure of data
- Evaluate and choose you implementation strategy
- Keep track of how your vendor changes their implementation and what that means for your solution
Expressing your consistency and availability needs in business terms is essential if you want to arrive at a decent solution. Evaluating without being able to compare the positive business outcomes of increased availability and possibly added development costs associated with eventual consistency, might lead you astray. Thinking about perfect consistency might also lead you astray. Although perfect consistency might be good for your business it has to be compared to what the business outcomes of reduced availability will be. Your own tests (and other tests e.g. benchmark tests) may also help you reason about how much performance and availability may be affected by your different options.
The last point (no 5) is very important to remember when you are using a cloud based solution since most vendors changes their hardware configuration, software configuration and software implementation from time to time. When cloud vendors make these changes you may be presented with new options, but some of these changes might just be carried out without giving you any new options; You simply get a “better service” – meaning that you may have to go back to testing again. At the end of the day, re-running your tests on a regular basis might be your best option. That way you do not have to worry about missing out on important vendor updates.