Caution! Database Sharding is like the anti-consolidation of databases. It is splitting the database into many small databases. You spend years and years on trying to unify and gather together databases and all of a sudden you are told that there is an application managed scaling technique using hundreds of independent databases. Tricky, right?
Sometimes, when planning database solutions in terms of scalability and massiveness, going beyond a certain point might be risky. This is the case when database shards may be of huge help (big website used globally). The word shard may sometimes refer to a piece of glass, a sea glass that can be found almost everywhere, for example at the beaches near San Francisco.
This is my friend Pasi Jaakkola. I took the picture in 2006 during Oracle Open World 2006. We brought however no databases, no glass shards from Frisco, not literally at least. But we both often dare “cross the point” and take the extra mile.
Here are 3 sources on database sharding:
1. Wei Hu’s excellent white paper Global Scale Web 2.0 Data Platforms Sharding with Oracle Database
2. Tom Kyte’s discussion on Shared Nothing
3. The Rise of Database Sharding on CodeFutures.
Let us have a look at number 3 and the 3 claims there about the several advantages of smaller databases:
Claim 1: Smaller databases are easier to manage
Claim 2: Smaller databases are faster
Claim 3: Database Sharding can reduce costs
Having a quick glance at the three statements above most people would say yes, true. But people who had been involved in the DBA work on data center level (meaning working with 100s of databases) might tend to disagree. Why? Here we go:
“Smaller databases are easier to manage”: if you have to break a databases into smaller databases, then you will have more monitoring configurations for Grid Control, thus the number of possible problems will increase. You will have to configure more backups and more problems on the way. Naturally. If you have to recover the data one day probably consistency of the data will be a rather important thing to consider, so you might have to recover not just one database to the timestamp needed. So what is easier to manage for a smaller database? The claim “Routine table and index optimizations can stretch to hours or days, in some cases making regular maintenance infeasible” is unclear to me. What is table optimization? Gathering statistics for the table? Online reorganization? Beats me 🙂
Next, “Smaller databases are faster”: yeah right. I have seen databases less than 1G in size having SQL statements running longer than the complete trilogy of the Godfather. I don’t even want to get into that. If a database is fast or not has nothing to do with the size. Check this claim out: “By hosting each shard database on its own server, the ratio between memory and data on disk is greatly improved, thereby reducing disk I/O.” Great! Nobody cares about the extra computers, the software licenses.. Let’s have a look at the other statement: “This results in less contention for resources, greater join performance, faster index searches, and fewer database locks.” So the smaller the database, the less locks you have? At least this is what I understand from the quotes sentence. And we all know that locks are due to clumsy SQLs and logical structure of the database, not due to the growing size of the database.
And now the most bizarre claim: “Database Sharding can reduce costs”. Right, cost! That is what all we want to hear. Byte the hook! Can but cannot. It can rather increase the cost. More hardware, more license fees, more databases to manage.
The real benefits of Database Sharding can be found in the Oracle’s white paper above. It is written by Wei Hu and it is exceptionally good! And good results with Sharding can be achieved with Oracle by the way. Forget about MySQL here. But decide for yourself after you read the white paper.
These are frosted shards of glass:
What is easier: to break the glass into these small pieces or do build the glass out of all these shards?
And the questions asked in the CodeFutures arcticle: “If Database Sharding is highly scalable, less costly, and improves performance, why hasn’t adoption of the technology been more widespread?” Probably, because not every Sharding is highly scalable, less costly, and improves performance. Really.
Conclusion: the real benefit of Database Sharding is often neglected. Main reason being the concept is wrongly comprehended. As Wei Hu says: “Sharding across many databases multiplies Oracle’s advantages a-thousand-fold”!