There is a ton of chatter on the Internet about Amazon SimpleDB, Apache CouchDB, Google App Engine’s Datastore API, and other distributed key-value data stores. Their biggest perceived advantage is scalability: they can help eliminate the bottleneck imposed by single-server databases.
But the hype around these new databases is growing frantic. This morning I read an article by Todd Hoff which fawned over SimpleDB’s unconventional rules to such an extent that I thought it might be satire. There are some significant drawbacks to developing in this new database paradigm. In fact, many of Mr. Hoff’s supposed advantages are actually serious disadvantages to the paradigm. Before designing your architecture around a database engine like SimpleDB, it’s important to consider the reasons not to do so.
Most of my points are directed at the Amazon SimpleDB service, but many also apply to other databases like CouchDB and the Google Datastore.
1. Data integrity is not guaranteed.
Data stores like SimpleDB don’t support the same rigorous constraints that RDBMSes do. Some of these databases support single-row constraints, like requiring data in certain fields, but it is nearly impossible for these systems to enforce UNIQUE constraints and foreign keys.
Programmers can work around this by issuing extra queries to confirm an update is valid, but this requires a lot of extra work. This will never be perfectly accurate–it may be impossible to avoid race conditions when two clients simultaneously attempt conflicting updates. And it’s especially difficult with SimpleDB because SimpleDB doesn’t guarantee that a client sees all the recent updates to the data.
2. Inconsistency will provide a terrible user experience.
Speaking of inconsistency, it’s critical to shield users from
this property of SimpleDB.
SimpleDB is optimized for fast writes. Your API calls return as soon as the data is written to the SimpleDB service, but before it’s replicated across all of the SimpleDB servers. If you issue any queries before the data is propagated, you won’t necessarily see your most recent change.
When I save my changes in your web application, I expect that your system will show me a consistent view of those changes. If you show me the data that’s in SimpleDB, my changes might not appear, and I’ll probably get confused. In fact, I will probably freak out, thinking that you lost my data. You can try to inform me about how this works (“It will take a few minutes for your changes to be visible…”) but that’s not easy for users to grasp.
3. Aggregate operations will require more coding.
SimpleDB does not support aggregate operations like joins, GROUP BY, SUM/AVERAGE functions, and sorting. You will need to implement these yourself.
Todd Hoff argues that this “suckiness” is a fair tradeoff:
SimpleDB shifts work out of the database and onto programmers which is why the SimpleDB programming model sucks: it requires a lot more programming to do simple things. I’ll argue however that this is the kind of suckiness programmers like. Programmers like problems they can solve with more programming. We don’t even care how twisted and inelegant the code is because we can make it work. And as long as we can make it work we are happy.
I disagree. More boilerplate code distracts you from actually solving real users’ needs. Why reinvent the GROUP BY wheel when MySQL, PostgreSQL and Oracle have already perfected it?
4. Complicated reports, and ad hoc queries, will require a lot more coding.
In my experience, database use falls into three broad patterns: (1) standard queries and updates performed by your application’s users; (2) more complicated reports for users and internal staff; and (3) ad hoc queries for troubleshooting and system monitoring. SimpleDB may be optimized for category 1, but categories 2 and 3 will be much more difficult without SQL.
Complicated reports are probably the best application of the SQL language. Because SQL is a declarative language, it’s incredibly easy to generate aggregate information about your data. In my previous jobs, our reports often required hundreds of lines of SQL to get the right information out of the database. This is a lot of code, but it was required to generate the data for our customers. Without access to SQL, your programmers will need to implement reports through imperative statements, which will exponentially increase the development time.
Ad hoc queries are even worse: they’re usually simpler, but they’re always changing. An RDBMS expert can often write an ad hoc SQL query as fast as the marketing department can explain what they need. Using an imperative programming language to write these queries would destroy your developers’ productivity.
5. Aggregate operations will be much slower if you don’t use an RDBMS.
RDBMSes are highly optimized for performing aggregate operations across huge volumes of data. Fast algorithms like the hash join, merge join, and indexed binary search have been around for 20 years or more. SimpleDB and the Google Datastore return datasets which are more like objects than traditional database rows. It’s unlikely that you’ll be able to process this data with anything other than nested loops, especially if your programmers aren’t database algorithm experts. Nested loop algorithms are considerably slower than the others.
Even if you’re the 31337est database expert and enjoy writing these operations in your business objects, there’s another performance factor to consider. In order for your application server to handle aggregate operations, you will need a copy of all the relevant data on the application server. Rather than downloading a single SUM function result from the database, your application server will need to download all the data required to calculate the sum. This extra data transfer will add considerable latency when you’re dealing with thousands or millions of records.
6. Data import, export, and backup will be slow and difficult.
Oracle, MySQL and other RDBMSes include advanced tools to perform large-scale data import and export operations. These tools have also been refined for 20 years or so, and can process millions of rows per minute. There are no such tools for key-value data stores, because these products are so new.
When you’re processing millions of records, network latency makes a big impact. Most of these services perform a remote procedure call for each record inserted; some even limit you to querying one record per remote call. On the Internet, round-trip latency is usually 20-40ms, which may slow you down to fewer than 2,000 rows per minute. (You can process more quickly via multi-threading, but again, that requires you to write a lot more infrastructure code.)
7. SimpleDB isn’t that fast.
Todd Hoff’s article referenced a SimpleDB performance test which found that 10 record IDs could be retrieved in 141ms from a 1,000-record table; in 266ms from a 100,000-record table; and in 433ms from a 1,000,000-record table.
Compared to relational databases, this is pretty slow.
If you want your web application to be responsive, you need your database queries to operate much faster than this. 20ms responses would be more in line with conventional databases. If you perform 3 SimpleDB queries in series, your web app will take about 1.5 seconds for that operation, and users will notice when the app is that slow. Many web applications actually make dozens of queries per request.
Further, tables with a million records aren’t large enough to need significant scalability. A million-record table is probably small enough to fit entirely in RAM; surely its indexes could fit in RAM. The real test of SimpleDB scalability is its performance on a table with 100 million or 1 billion records.
8. Relational databases are scalable, even with massive data sets.
The world’s largest companies all use giant relational databases, and they’ve been able to make this work. The world’s largest websites use relational databases, and they’ve also been able to scale successfully. Facebook and LiveJournal use MySQL; MySpace uses Microsoft SQL Server; Salesforce.com uses Oracle. When websites like Friendster have scalability issues, it’s not usually because of the RDBMS.
We all expect Oracle to scale if we pay them enough money, but even free databases have made significant advances to prevent the database server from becoming a bottleneck. The first line of defense is caching–eliminating repetitive queries can offload massive amount of processing. Beyond caching, there are free clustering engines which let you balance your database requests around a few servers in a cluster.
Without a complicated clustering setup, your data can usually be partitioned across multiple servers to eliminate the single-server bottleneck. Lest you think I’m ragging on Todd Hoff, he’s written a nice overview of sharding, one way of designing a federated database to get around the bottleneck.
9. Super-scalability is overrated. Slowing the pace of your product development is even worse.
Time-to-market is a critical factor for most software products. If you’re writing internal software for a business, budgetary concerns are equally critical. You can workaround most of the drawbacks I’ve identified above, but it will cost you time and money.
More importantly, all these technical workarounds distract you from addressing the real needs of your customers. If you don’t focus on making something people want, it doesn’t matter how scalable your database is, because you won’t have any customers to fill up the database.
The hype around the new data stores seems to be a case of premature optimization, yet we all know Donald Knuth’s famous quote, “Premature optimization is the root of all evil.” Why not wait and address super-scalability once you’ve created a super product and have generated super cash flow?
10. SimpleDB is useful, but only in certain contexts.
Everyone’s assuming that SimpleDB was designed to be a general-purpose replacement for OLTP database servers. I don’t think it was ever intended for that purpose. SimpleDB’s architecture is similar to Dynamo, Amazon’s internal “highly-available key-value store.” One of its main distinguishing features is the flexible schema: the ability to add custom fields to individual records, and to store multiple values in each field.
If you’re working with “semi-structured” data, then this is actually incredibly useful. For example, it’s an awesome way to persist web application sessions. You can avoid the overhead of marshaling the object-oriented session data into columns and rows, and many of the drawbacks above don’t apply because you don’t generally query sessions like you query more typical relational data.
Amazon SimpleDB, Apache CouchDB, and the Google Datastore API aren’t bad products. But we do them a disservice when we construe them to be replacements for general-purpose databases. Used carefully, they can help your organization. But used indiscriminately, you’ll create a lot more work for your programmers and you’ll make your application perform even worse.
