There is a ton of chatter on the Internet about Amazon SimpleDB, Apache CouchDB, Google App Engine’s Datastore API, and other distributed key-value data stores. Their biggest perceived advantage is scalability: they can help eliminate the bottleneck imposed by single-server databases.
But the hype around these new databases is growing frantic. This morning I read an article by Todd Hoff which fawned over SimpleDB’s unconventional rules to such an extent that I thought it might be satire. There are some significant drawbacks to developing in this new database paradigm. In fact, many of Mr. Hoff’s supposed advantages are actually serious disadvantages to the paradigm. Before designing your architecture around a database engine like SimpleDB, it’s important to consider the reasons not to do so.
Most of my points are directed at the Amazon SimpleDB service, but many also apply to other databases like CouchDB and the Google Datastore.
1. Data integrity is not guaranteed.
Data stores like SimpleDB don’t support the same rigorous constraints that RDBMSes do. Some of these databases support single-row constraints, like requiring data in certain fields, but it is nearly impossible for these systems to enforce UNIQUE constraints and foreign keys.
Programmers can work around this by issuing extra queries to confirm an update is valid, but this requires a lot of extra work. This will never be perfectly accurate–it may be impossible to avoid race conditions when two clients simultaneously attempt conflicting updates. And it’s especially difficult with SimpleDB because SimpleDB doesn’t guarantee that a client sees all the recent updates to the data.
2. Inconsistency will provide a terrible user experience.
Speaking of inconsistency, it’s critical to shield users from
this property of SimpleDB.
SimpleDB is optimized for fast writes. Your API calls return as soon as the data is written to the SimpleDB service, but before it’s replicated across all of the SimpleDB servers. If you issue any queries before the data is propagated, you won’t necessarily see your most recent change.
When I save my changes in your web application, I expect that your system will show me a consistent view of those changes. If you show me the data that’s in SimpleDB, my changes might not appear, and I’ll probably get confused. In fact, I will probably freak out, thinking that you lost my data. You can try to inform me about how this works (”It will take a few minutes for your changes to be visible…”) but that’s not easy for users to grasp.
3. Aggregate operations will require more coding.
SimpleDB does not support aggregate operations like joins, GROUP BY, SUM/AVERAGE functions, and sorting. You will need to implement these yourself.
Todd Hoff argues that this “suckiness” is a fair tradeoff:
SimpleDB shifts work out of the database and onto programmers which is why the SimpleDB programming model sucks: it requires a lot more programming to do simple things. I’ll argue however that this is the kind of suckiness programmers like. Programmers like problems they can solve with more programming. We don’t even care how twisted and inelegant the code is because we can make it work. And as long as we can make it work we are happy.
I disagree. More boilerplate code distracts you from actually solving real users’ needs. Why reinvent the GROUP BY wheel when MySQL, PostgreSQL and Oracle have already perfected it?
4. Complicated reports, and ad hoc queries, will require a lot more coding.
In my experience, database use falls into three broad patterns: (1) standard queries and updates performed by your application’s users; (2) more complicated reports for users and internal staff; and (3) ad hoc queries for troubleshooting and system monitoring. SimpleDB may be optimized for category 1, but categories 2 and 3 will be much more difficult without SQL.
Complicated reports are probably the best application of the SQL language. Because SQL is a declarative language, it’s incredibly easy to generate aggregate information about your data. In my previous jobs, our reports often required hundreds of lines of SQL to get the right information out of the database. This is a lot of code, but it was required to generate the data for our customers. Without access to SQL, your programmers will need to implement reports through imperative statements, which will exponentially increase the development time.
Ad hoc queries are even worse: they’re usually simpler, but they’re always changing. An RDBMS expert can often write an ad hoc SQL query as fast as the marketing department can explain what they need. Using an imperative programming language to write these queries would destroy your developers’ productivity.
5. Aggregate operations will be much slower if you don’t use an RDBMS.
RDBMSes are highly optimized for performing aggregate operations across huge volumes of data. Fast algorithms like the hash join, merge join, and indexed binary search have been around for 20 years or more. SimpleDB and the Google Datastore return datasets which are more like objects than traditional database rows. It’s unlikely that you’ll be able to process this data with anything other than nested loops, especially if your programmers aren’t database algorithm experts. Nested loop algorithms are considerably slower than the others.
Even if you’re the 31337est database expert and enjoy writing these operations in your business objects, there’s another performance factor to consider. In order for your application server to handle aggregate operations, you will need a copy of all the relevant data on the application server. Rather than downloading a single SUM function result from the database, your application server will need to download all the data required to calculate the sum. This extra data transfer will add considerable latency when you’re dealing with thousands or millions of records.
6. Data import, export, and backup will be slow and difficult.
Oracle, MySQL and other RDBMSes include advanced tools to perform large-scale data import and export operations. These tools have also been refined for 20 years or so, and can process millions of rows per minute. There are no such tools for key-value data stores, because these products are so new.
When you’re processing millions of records, network latency makes a big impact. Most of these services perform a remote procedure call for each record inserted; some even limit you to querying one record per remote call. On the Internet, round-trip latency is usually 20-40ms, which may slow you down to fewer than 2,000 rows per minute. (You can process more quickly via multi-threading, but again, that requires you to write a lot more infrastructure code.)
7. SimpleDB isn’t that fast.
Todd Hoff’s article referenced a SimpleDB performance test which found that 10 record IDs could be retrieved in 141ms from a 1,000-record table; in 266ms from a 100,000-record table; and in 433ms from a 1,000,000-record table.
Compared to relational databases, this is pretty slow.
If you want your web application to be responsive, you need your database queries to operate much faster than this. 20ms responses would be more in line with conventional databases. If you perform 3 SimpleDB queries in series, your web app will take about 1.5 seconds for that operation, and users will notice when the app is that slow. Many web applications actually make dozens of queries per request.
Further, tables with a million records aren’t large enough to need significant scalability. A million-record table is probably small enough to fit entirely in RAM; surely its indexes could fit in RAM. The real test of SimpleDB scalability is its performance on a table with 100 million or 1 billion records.
8. Relational databases are scalable, even with massive data sets.
The world’s largest companies all use giant relational databases, and they’ve been able to make this work. The world’s largest websites use relational databases, and they’ve also been able to scale successfully. Facebook and LiveJournal use MySQL; MySpace uses Microsoft SQL Server; Salesforce.com uses Oracle. When websites like Friendster have scalability issues, it’s not usually because of the RDBMS.
We all expect Oracle to scale if we pay them enough money, but even free databases have made significant advances to prevent the database server from becoming a bottleneck. The first line of defense is caching–eliminating repetitive queries can offload massive amount of processing. Beyond caching, there are free clustering engines which let you balance your database requests around a few servers in a cluster.
Without a complicated clustering setup, your data can usually be partitioned across multiple servers to eliminate the single-server bottleneck. Lest you think I’m ragging on Todd Hoff, he’s written a nice overview of sharding, one way of designing a federated database to get around the bottleneck.
9. Super-scalability is overrated. Slowing the pace of your product development is even worse.
Time-to-market is a critical factor for most software products. If you’re writing internal software for a business, budgetary concerns are equally critical. You can workaround most of the drawbacks I’ve identified above, but it will cost you time and money.
More importantly, all these technical workarounds distract you from addressing the real needs of your customers. If you don’t focus on making something people want, it doesn’t matter how scalable your database is, because you won’t have any customers to fill up the database.
The hype around the new data stores seems to be a case of premature optimization, yet we all know Donald Knuth’s famous quote, “Premature optimization is the root of all evil.” Why not wait and address super-scalability once you’ve created a super product and have generated super cash flow?
10. SimpleDB is useful, but only in certain contexts.
Everyone’s assuming that SimpleDB was designed to be a general-purpose replacement for OLTP database servers. I don’t think it was ever intended for that purpose. SimpleDB’s architecture is similar to Dynamo, Amazon’s internal “highly-available key-value store.” One of its main distinguishing features is the flexible schema: the ability to add custom fields to individual records, and to store multiple values in each field.
If you’re working with “semi-structured” data, then this is actually incredibly useful. For example, it’s an awesome way to persist web application sessions. You can avoid the overhead of marshaling the object-oriented session data into columns and rows, and many of the drawbacks above don’t apply because you don’t generally query sessions like you query more typical relational data.
Amazon SimpleDB, Apache CouchDB, and the Google Datastore API aren’t bad products. But we do them a disservice when we construe them to be replacements for general-purpose databases. Used carefully, they can help your organization. But used indiscriminately, you’ll create a lot more work for your programmers and you’ll make your application perform even worse.


33 comments ↓
You nailed it.
[...] earth-shattering news, Ryan Park has revealed that CouchDB is not a relational database, and therefore may not behave like a relational database. [...]
I would suspect that the scalability solutions for e.g. MySQL are for mostly-read scenarios. I don’t think anybody can do linear horizontal scaling of heavy-write scenarios. Except Real Application Clusters.
That said, Oracle does have an in-memory, key-value pair based system which is highly (5000-node clusters in production) and linearly scalable, can do aggregations over the entire grid, and works on objects.
Oracle Coherence.
Costs an arm and a leg, but solves all the issues raised in this article (yes you can do aggregations and SQL-like queries, and they are automatically run “in parallel” across the entire grid).
We heard these similiar things during MySQL vs RDBMS saga, and why soon the sky will fall down on everyone using MySQL instead of a read database. Still MySQL is going stronger than ever. There are a lot of FUD on the net as always. What’s missing is a few real life true story on how someone lost business because they chose to use a non RDBMS solution where RDBMS prefered that they should be doing otherwise.
>This morning I read an article by Todd Hoff which fawned over SimpleDB’s unconventional rules to such an extent that I thought it might be satire.
You don’t miss much, do you, slick?
So funny:
“We all expect Oracle to scale if we pay them enough money…”
Excellent points, all 10 of them! Thanks for the write up.
> I’ll argue however that this is the kind of suckiness programmers like. Programmers like problems they can solve with more programming. (By Mr.Todd Hoff)
I also strongly disagree with this. Yes, programmers do like solving problems but only new problems and challenges not those problems for which the easy solutions already exist.
I thought that the scalability proposed by this model is not in the “number of records in the table” but in the “number of simultaneous reads”. As in having a lot of simultaneous users searching for an item, not as in having a couple of users searching through a lot of items.
That said, I would’ve just stated one reason not to use this (#10: You almost surely don’t need it).
How can you support shards while not support simpledb?
Surely you can’t do a group by if you use shards?
MapReduce and parallel execution solves a fair number of the arguments above.
Using a Real Database (c)(tm) will solve ALL of the above problems.
The polished turds from Amazon and Google are still turds, though shiny.
[...] Top 10 Reasons to Avoid the SimpleDB Hype Amazon SimpleDB, Apache CouchDB, and the Google Datastore API aren’t bad products. But we do them a disservice when we construe them to be replacements for general-purpose databases. (tags: simpledb) [...]
I love how you rdbms fanboys are scared out of your pants by the new wave of stuff. I love it.
Here’s why you /should/ be scared. Google (and to a lesser degree Amazon) /already/ run on this new breed of DB. And their apps /work/. And their apps /scale/, massively. No stupid sharding or rdbms babysitting required.
Sorry, but if you think traditional rdbms scale without problems, you either have 0 experience with large systems or you are being disengenuous.
ha, and one more thing. Once you shard (as all the big-boy rdbms shops eventually have to do, e.g. youtube), most of your arguments go away, too: you no longer get automatic integrity, consistency, aggregate ops, etc. Once you shard, YOU HAVE TO DO A LOT OF CODING to make up for the INHERENT NON-SCALABILITY OF RDBMs.
Sorry, you can’t get around it. MUCH BETTER to go in with the assumption that you’ll have these problems (because, hey, you are creating something that’s going to be successful, right?) and plan from day 1 to deal with them.
That’s one of the beauties of couch/simple/bigtable — you can’t hide behind some empty promise from some big RDBM vendor. You have to face the truth from the start.
And you know what? The truth isn’t so bad. It’s quite elegant, actually.
You seem to have missed the point of these pieces of software. By structuring your application to use these new datastores, you don’t need to worry about fork-lift upgrades and all the other scaling problems of a traditional RDBMS as you get bigger and take on more load. It scales up just like the rest of you app: by adding more machines.
Yahoo!, eBay, Facebook, etc scale their RDBMSs by doing the same thing that SimpleDB or BigTable do internally: by sharding the data down to finer and finer levels of keyspace as the number of machines grows. Except, with an RDBMS, this is a manual process. They also use read-only slaves to distribute read load, something also implicit in BigTable/SimpleDB/etc with their use of simple block replication without the concurrent use of erasure coding (e.g. RAID). RAID is also external to the RDBMS, requiring you to manage both disparately.
Also, Oracle can scale up to 64 nodes at max with a clustered filesystem (this is somewhat old, it might be 128 by now). Google has ~650,000 machines in their clusters. This is 4 orders of magnitude difference. No one has enough money to pay Oracle to scale to this level. Yahoo! and Facebook have gotten MySQL to run on more boxes than this, but not in a cluster, so they (like you and everyone else) are stuck with the manual process of sharding and shard management.
If you don’t expect to grow, by all means, continue to live in the RDBMS past. However, if you’re app is subjected to possible rapid growth (e.g. a Facebook app, Salesforce app, GAE app, pretty much any Web-facing app) you should definitely be thinking about how to leverage SimpleDB/HBase/Hypertable/CouchDB/etc in your design. I see a lot of posts of this nature lately and they all seem to be coming from the initial shock of what you *can’t* do with these systems. Give them a shot and see what you *CAN* do with some semi-clever design and you might be surprised.
Good entry, Ryan, and quite on mark. I had just come across a quotation of the article you quote in #3, and at the time I thought he was being sarcastic. I’m greatly saddened to think that he was actually being serious.
I find it most interesting seeing all of the cheerleading for SimpleDB and similes by people who are quite evidently clueless about databases, so they embrace and flaunt their ignorance, using Google and Amazon as a “Big Daddy” of sorts, always ready to reference.
Guess what, kids - you aren’t Google, Amazon, or Facebook. The chance that your web toy will ever be a fraction as popular as those sites is so vanishingly small that is creating an underflow condition.
Google has a very specialized database, and their needs are absolutely nothing like almost anyone else. Amazon likewise. Until the day that you build your own specialized database, an RDBMS is often a suitable choice.
And the scalability ruse….extraordinary. The numbers I’ve seen for these “scalable” database technology are need to be scalable because they’re such incredibly poor performers.
Alas, everything old is new again. Here we have cheerleaders heralding the arrival of basically exactly what people did before real databases were invented. Hurrah for the past!
“Guess what, kids - you aren’t Google, Amazon, or Facebook. The chance that your web toy will ever be a fraction as popular as those sites is so vanishingly small that is creating an underflow condition.”
That’s exactly right. If you are creating a toy app, you don’t need to scale. If you are creating a toy app, you should use a toy db, i.e., an rdbms.
Google is not the only one that needs a scalable db. /ANYONE/ who ever hopes to have > 100,000 users is going to start running into scalability problems and eventually face the reality of the SHARDING NIGHTMARE if they use an RDBMs.
And if you are making something for < 100,000 users, you really probably ought to just stop now, shouldn’t you?
Okay, it seems to have mangled my last post, so let me format slightly…
+And if you are making something for < 100,000 users, you really probably ought to just stop now, shouldn’t you?
Ho ho ho. Awesome stuff.
Yeah, I guess making systems managing billions in funds just doesn’t cut into realm of the awesome systems that you make.
You are simply delusional.
+And if you are making something for 100,000 user sites do you have, jackson? Care to point a couple out?
Now I presume you must mean 100,000 simultaneous users, because there are quite a few >100K user sites easily running on some shitty RDBMS (e.g. MySQL) on a low-end desktop PC. Slashdot, for instance, which was pretty much a worst case because they were caching nothing, and generating every request live from the database.
Clearly you have needs far beyond /. in their heyday.
Better still, you have needs beyond Slashdot in their heyday, and an apparently miniscule budget. My dev database server is a 16-core, 6-disk monster, serving up an unbelievable transaction load.
Not good enough for jackson’s imaginary success story, though.
@jackson - “And if you are making something for < 100,000 users, you really probably ought to just stop now, shouldn’t you?”
the answer you troll for is “Nope”. in fact, i’d say just the opposite. if you know before you start that your app will need upwards of a hundred thousands users to be useful to its audience, “you really probably [sic] ought to just stop now.”
You definitely have some good points — but you’re also being unfairly harsh and not scoring any higher marks on presenting a balanced viewpoint.
One obvious one that caught my eye: #7: “SimpleDB isn’t that fast” — Todd specifically pointed out (right in there with the performance numbers he was quoting…) that tools like SimpleDb are NOT fast. That’s not the point; they exist to address scaling issues.
Some other lines you apparently considered fawning or possibly satiric:
“If you have a complex OLAP style database SimpleDB is not for you. But, if you have a simple structure, you want ease of use, and you want it to scale without your ever lifting a finger ever again, then SimpleDB makes sense. The cost is everything you currently know about using databases is useless and all the cool things we take for granted that a database does, SimpleDB does not do.”
That sounds an awful lot like what you’re saying in #10. But you start that off with “Everyone’s assuming that SimpleDB was designed to be a general-purpose replacement for OLTP database servers.”
Sorry for the rant; I guess I’m just saying you clearly have some useful input to add to the discussion — just leave the straw man nonsense at home, please.
Can someone remind me why application programmers should care so much about the integrity constraint checking afforded by relational databases? There are only a small set of constraints that can be checked without programming. And, from what I’ve seen, the only way to get decent error reporting in my application is to check all of the constraints myself, anyway.
Last time I checked, Amazon *do not* use SimpleDB to power their online store. Rumours of Postgres abound …
PS - do you know that your comment filter rejects valid email addresses such as root@localhost.localdomain? (at least, that’s where all my cron jobs send it …
“When websites like Friendster have scalability issues, it’s not usually because of the RDBMS.”
@toby has it exactly right. RDBMS _can_ scale, but at *significant* costs in both money and developer/sysadmin/DBA time.
[...] 24th, 2008 in Links Ten reasons SimpleDB is over-hyped. I’ve been arguing some of the same points myself: losing the expressiveness of SQL [...]
The hype has been pretty intense and peoples’ perceptions of what SimpleDB et al are useful for have grown pretty inflated. Most applications database component are made up of 3 distinct areas; 1 user related data, 2 data that structures the user experience within the app, 3 rapidly growing and dynamic data. I feel that 1 and 2 are best served by an RDBMS while 3 is a good fit for SimpleDB. Take YouTube. User data and user created scalar data is tiny compared with Video data and its associated data. If a video search is not fully optimised, no-one dies. However, a user does want to be sure that their favourites, channelss, etc are stable.
Regards
D
First of all I’d like to note that the below comments are not about SimpleDB but rather to prevent FUD about document-based databases.
1. Data integrity is not guaranteed.
This could be the case with SimpleDB, but overall nothing prevents document databases from managing data integrity very well.
Regarding the constraints, there is nothing that prevents defining validations in a document or its related “meta” document (this is pretty much how StrokeDB works — you can define your validations within meta document and they will let your document stay validated)
More interesting are the concerns about the conflicts. I’d say that this problem is hardly addressed in a common RDBMS approach. All you usually get is either user’s A or user’s B most recent update — there seems to be no easy way graceful conflict resulution. On the contrary, since document databases approach is rather novel there is certainly enough room to adopt ways to deal with conflicts. For example, with different and configurable algorithms — like merging them slot-by-slot 3-ways, or even some special programmer-defined algorithms. I can hardly imagine how to do this sort of stuff with traditional RDBMS in a relatively easy manner.
2. Inconsistency will provide a terrible user experience.
First of all, it should noted that described inconsistencies are also quite possible with distributed RDBMS setups — they too are constrained by a certain lag before the data is going to be propagated through replicas.
The actual problem is not with lag — it is more about leaving documents in a consistent state.
This problem could be easily addressed in any kind of database, either relational or document-based.
3. Aggregate operations will require more coding.
Again, while this seems to be true for SimpleDB, other document-based databases address this problem pretty well with Views approach (CouchDB, StrokeDB [Views is WIP]) — so you can define any kind of aggregation, even such that are simply not supported by RDBMS.
More at http://rashkovskii.com/articles/2008/4/26/top-10-reasons-to-avoid-document-databases-fud
[...] Top 10 Reasons to Avoid the SimpleDB Hype — ryanpark.org (tags: http://www.ryanpark.org 2008 mes3 dia25 at_home SimpleDB blog_post) [...]
You are missing the point completely.
Databases != RDBMS. RDBMS is but “one” kind of database. Then you have hierarchical, object-based, document-based, etc.
SimpleDB is but one kind of non-RDBMS database. There are use cases that fit RDBMS, that are use cases that make RDBMS cry. That’s where SimpleDB or other alternatives get into the game.
Just as simple as that. When all you know is a hammer, all your problems are nails.
This article is just wanting more traffic by generating FUD to newbies.
I couldn’t agree more with AkitaOnRails, you are comparing totally wrong stuff, this shows in a lot of situations a lack of experience.
The points in this article are true, but not reasonable or relevant, and more importantly highly imbalanced. Hence I am sorry to say -> FUD.
Google scales because it can afford the SKUs for storing their data and hence can throw machines at resolving scaling problems. However, if you need to run Oracle or any other RDBMS, you will have to empty your wallets to scale up. Most online applications do not require the zillion RDBMS features that are not optimized for the characteristics of typical online apps which are more read heavy. I should also point out that I have unfortunately seen people ridiculously normalize their schemas even for read heavy apps when they could have easily spent more on writing multiple times.
Leave a Comment