What you must know about DB Sharding?
Cloud & Backend EP 7 - Actionable tips on top sharding strategies
Is your database server stretched to its limits?
Are database CPU spikes and late-night support calls becoming more and more common?
Are you looking at ways to scale your database?
If any of this is true, database sharding can be a potential solution to your problems.
Of course, it may not be the only solution. But as a backend developer, you must consider and compare it with the other things on the table.
Hello 👋 and welcome to a new edition of the Cloud & Backend newsletter.
In this post, I help you understand the various aspects of database sharding, the strategies of sharding, and actionable tips you must take into account before deciding to go for sharding.
Here’s the agenda:
🤖 What is Sharding and why do we need it in the first place?
🥊 Logical vs Physical Sharding and how are they used in a practical setup.
💼 Top Sharding Strategies and when you should use a particular strategy.
⚖️ Benefits & Drawbacks of Sharding and when to use sharding versus when to avoid it like the plague?
🧶 Adopting Sharding and things you must ask yourself before deciding to shard your database.
🤖 What is Sharding?
Sharding is a database architecture pattern.
It evolves out of horizontal partitioning in which you separate the rows of one table into multiple different tables, known as partitions.
Here’s an illustration that shows how horizontal partitioning works in practice.
And what better way to demonstrate this than by partitioning Star Wars character intros based on their era? 🤓
But what’s the point of partitioning data? 🤔
Just like many database strategies, partitioning also aims to reduce the effort of querying data.
The less number of records a query has to run over, the more performant it will be.
As you can see in the above diagram, partitioning helps you reduce the number of records a query has to run over.
🥊 Logical vs Physical Sharding
When you start with sharding, you typically start with logical shards (also known as partitions).
Most databases have out-of-the-box support for partitioning.
👉 Logical shards are horizontal partitions of your data based on some strategy. They sit on the same database instance or server.
However, when a single database instance is not able to handle the workload even after partitioning, you need to go for physical sharding.
👉 In physical sharding, each shard is hosted on a different node or server instance.
However, a physical shard can contain one or more logical shards.
Here’s what it may appear like in practice.
Does it look complicated?
Yes, perhaps.
But why you might need to go for something like this? 🤔
Consider a real-life practical example of a large e-commerce platform that operates globally.
🚀 What are the goals of such a platform?
The platform must store and manage a large amount of data related to product listings, customer orders, and inventory across multiple geographic regions.
The database must be highly available, scalable, and performant enough to handle the high request and transaction volumes. Remember - these requests occur across different regions.
🧑🚀 How to achieve the goals?
You can use physical sharding to divide the data into multiple shards. Different shards handle different geographic regions making the data distributed.
Within each physical shard, you can use logical sharding to group related data together. Think of data such as orders for specific product categories.
This can help improve query performance by minimizing the amount of data that must be accessed for each request. Also, the write or update operations are distributed across multiple servers.
Remember - the goal of sharding is to help your application deal with fewer data for a given request.
Let’s now look at the most popular sharding strategies.
💼 Sharding Strategies
There are multiple sharding strategies that you can adopt depending on your requirement.
👉 Key-Based Sharding
In this strategy, you partition the data based on a hash function applied to one or more columns in the table. That’s why this type of sharding is also known as hash-based sharding.
Usually, you apply the function to the primary key column but you can also use a combination of columns.
The result of the hash function determines which shard the data will be stored in.
A very simple hash function might look as below:
function shardNumber(num, numShards) {
const shardSize = Math.ceil(num / numShards);
return Math.floor(num / shardSize);
}
Here’s what key-based sharding looks like in practice.
Key-based sharding is great for tables where the data is not naturally partitioned and you want to evenly distribute the data across multiple shards.
👉 Range-Based Sharding
In this strategy, you partition the data based on a specific range of values in a particular column.
The below illustration makes it easier to understand.
Here, you have a product table that contains a big list of products with their prices. You can create different shards based on the price range they fall into.
For example, shard 1 contains all products within the price band of $0-$75. Shard 2 contains all products within the price band of $76-$150.
The main benefit of range-based sharding is the ease of implementation. However, it’s far from perfect and we will look at the downsides in the last section of this post.
👉 Directory-Based Sharding
In this strategy, you perform sharding based on a lookup table.
Think of this table as a directory (similar to an address book) that holds the relation between the data and the specific shard where you can find it. Hence, the name directory-based sharding.
Again, an illustration can explain things much more efficiently.
In the above example, the Location field acts like a shard key.
Data from the shard key is written to a lookup table that maps the key to a particular shard. For example, data for the USA location is stored in shard 1, and so on.
The main benefit of directory-based sharding is higher flexibility when compared to the other strategies.
⚖️ Benefits and Drawbacks of Sharding
Sharding is usually not the first choice to achieve database scalability.
In fact, it has its own share of benefits and drawbacks you must look into before betting your money on sharding.
Here are the main benefits of sharding:
Sharding helps you achieve horizontal scaling or scaling out.
With sharding, you can reduce query response times since the queries run on a smaller set of data.
Sharding also helps make an application more reliable. Even if a shard goes down, the entire application does not stop functioning.
There are also several drawbacks of sharding:
Sharding makes your application quite complex. Instead of one place, the data is scattered across multiple servers.
Even after you have implemented a great sharding strategy, the shards can get unbalanced over time resulting in hotspots.
With sharding, you can practically say goodbye to joining data across multiple shards or performing transactions.
🧶 Before you adopt Sharding
Time to look at some practical stuff to consider before you decide to leverage sharding.
👉 When to go for sharding?
At the bare minimum, you should not go for premature sharding.
Sharding is a complex business and it should never be the first option on the table when your application faces problems with database scalability.
You should look at sharding only when you have exhausted other options like proper indexing, replication, and partitioning.
👉 How to design an appropriate hash for sharding?
Key-based sharding is typically the most popular sharding strategy. However, the success of this strategy depends on the hash function.
Few things to keep in mind while designing a hash function for sharding:
Try to make the hash function fast and efficient.
The hash function should result in uniform distribution of keys.
Make the hash function deterministic in nature. In other words, it should always produce the same output for a given input.
👉 Are sharding strategies different if reads are more important than writes?
Yes, sharding strategies can differ depending on whether reads or writes are more important.
🚀 If reads are more important, you should go for a sharding strategy that prioritizes query performance.
One approach to support this is to use a range-based sharding strategy that results in more efficient range queries and results in faster query execution.
🚀 If writes are more important, your sharding strategy should prioritize write throughput and data consistency.
In other words, you can choose a shard key that distributes writes evenly across shards and minimizes the likelihood of hotspots.
A hash-based sharding strategy can help with this.
Another approach can be to use a time-based sharding strategy where data is partitioned based on time intervals such as hours or days. This can help distribute write operations evenly. Discord uses this time-based sharding strategy to handle trillions of messages.
👉 When NOT to use a particular sharding strategy?
Sometimes, elimination is a great way to arrive at an ideal solution.
Since all strategies have potential problems, you can arrive at an ideal strategy by rejecting a particular strategy.
Here’s a quick reference you can use:
🚀 Avoid key-based sharding if you need to add more shards on a frequent basis to rebalance the data. When you add a shard, your hashing function will start to give a different result if it depends on the number of shards. This means you need to make a change across all the clients using the hash function and make corresponding changes.
🚀 Avoid range-based sharding if data within each range can be drastically different. If this happens, you will have an uneven distribution of data resulting in database hotspots.
🚀 Avoid directory-based sharding if you don’t want a single point of failure like a lookup table. The entire sharding logic will go for a toss if something goes wrong with the lookup table.
⚽ Over to you
Have you used sharding in your application?
If yes, which database and which strategy have you employed?
If not, are you keen on doing sharding or do you prefer another approach to scale your databases?
Write your replies or thoughts in the comments section.
And that’s all for today.
If you found today’s post useful, give the post a like and consider sharing it with friends and colleagues.
Wishing you a great week ahead! ☀️
See you later.