Fundamentals of system design Chapter 7: Understanding Database Sharding

Posted May 25, 2023 Updated Oct 29, 2024

By Fredrick Kamau

11 min read

Introduction

A database holds a significant importance within a system and it’s design is very critical in the performance of an application. For Example, let’s examine the early stages of a social media platform like FaceBook. Presumably, they employed a single database for accessing and modifying data. Everything operated smoothly. However, as the platform gained immense popularity over the course of a few years, the user base grew exponentially. By the end of 2008, there were approximately 200 million active users. The application’s response time will significantly slows down, leading to a poor user experience. This degradation in performance can be attributed to the database acting as a bottleneck, resulting in high latency and low throughput. To address this issue, it becomes necessary to scale the database. Two potential approaches to accomplish this are database replication and database sharding. In the context of a social media platform with an enormous volume of user-generated data being produced every second, database sharding emerges as a superior solution to alleviate these challenges.

Real life analogy

Imagine you have a large library with a massive collection of books. Books ranging from Fiction, Poetry, Novels, Science, Children literature and so forth. The library is very popular and it attracts thousands of readers a day. As a librarian it’s a nightmare to manage and retrieve books efficiently, especially if they are not stored in a particular order. To solve this problem, you decide to separate the library into multiple smaller libraries. Each library is an independent entity with it’s own collection of books, staff and resources. Think of the small libraries as shards.

You divide the books among these smaller libraries based on some predefined rule. For instance, you might decide to shard the books based on the first letter of the author’s last name. All books by authors whose last names start with A-E go to Library 1, F-J to Library 2, and so on. Each library only manages a subset of books, avoiding the burden of handling the entire collection.

Visitors are now directed to the appropriate library based on the book they are looking for. If a visitor wants a book written by an author whose last name starts with C, they are guided to Library 1.

Each library operates independently and manages its own collection. The libraries can handle their visitors’ requests, maintain their shelves, and perform administrative tasks without depending on other libraries. If one library experiences issues or requires maintenance, it doesn’t affect the operations of other libraries.

As the number of books and visitors keeps growing, you can continue to create more libraries and distribute the books accordingly.

In the context of databases, sharding works similarly. The database is divided into multiple shards, each hosted on a separate server or node. Data is distributed among these shards based on a predefined rule or shard key. Each shard manages its own subset of data, and queries or operations are directed to the relevant shard.

What is database sharding?

Database Sharding is the process of storing a large database across multiple machines. It’s the process of separating a table rows into multiple table rows known as partitions. Each partition has the same schema and columns. The data is unique and independent of other partitions. This is an example of horizontal database scaling.

Original User Table

UserID	FirstName	LastName	Gender	Region
1	John	Otieno	Male	Africa
2	Bob	Henry	Male	Europe
3	Mark	Biden	Male	North America
4	Alice	Lee	Female	Asia

Horizontal Partition 1

UserID	FirstName	LastName	Gender	Region
1	John	Otieno	Male	Africa
2	Bob	Henry	Male	Europe

Horizontal Partition 2

UserID	FirstName	LastName	Gender	Region
3	Mark	Biden	Male	North America
4	Alice	Lee	Female	Asia

Sharding involves spliting data into two or more chunks. The shards are then distributed across multiple database nodes. The data held across all the shards cumulatively represent an entire dataset. In most cases, sharding is implemented on the application level, the application contains logic which defines where to get read or write data from. In the above example, FaceBook splitting the user tables according to locale should improve data reads/writes significantly. If a request is coming from Africa, the application logic should forward the request to the Shard that contains Africa data instead of traversing through a single database with billion of user records. This improves the performance of the system significantly, lower latency and higher throughput by removing load on a single database. This can further be improved by installing a database node around the region of the users.

Sharding Architectures.

Once you have decided to shard your database, The next question is to figure how you will go about it. When distributing workload to different shards, it’s crucial that the requests goes to the right/intended database. Below are a few common sharding architectures:

Range Based / Dynamic Sharding Range based sharding splits database rows of a table based on a range of values. Using this method, you will need to check user’s name, data will be stored in a shard according to the first alphabet.
Name Shard Key
Starts with A to I A
Starts with J to S B
Starts with T to Z C
The application will map the shard key to a physical node and store that row on the machine. Range-based sharding is easy to implement, however, it can result to data overloading on a single physical node. In our example, shard A which contains names that starts from A-I might contain a much more number or rows compared to shard C.1.
Hashed sharding Hashed sharding makes use of a hashed function(mathematical formula) to assign a shard key to each row of the database. The hash function takes the information of row and produces a hash value which is used as the shard key.
Name Shard Value
Bob 1
Alice 2
Chad 1
Diana 2
Hashed functions are best when you want to distribute data evenly among the physical shards. However it does not separate the database according to meaningful information.
Directory Sharding
Directory sharding uses a lookup table to match database information to a shard. A lookup table contains static set of information about where specific data can be found.
Company Name Shard Key
Company A A
Company B B
Company C C
Each shard is a meaningful representation of the database and is not limited by ranges. However, directory sharding fails if the lookup table contains the wrong information.
Geo sharding Geo sharding splits and store information according to the user geographical location. In the FaceBook example, users information can be stored according to the user region for example Africa, Europe, Middle East, Asia, North America and so forth. Geo sharding greatly improve the user experience as it’s able to respond to user requests faster because of the distance between user and server. However it can also result to uneven distribution of data.

Name	Shard Key
Starts with A to I	A
Starts with J to S	B
Starts with T to Z	C

Name	Shard Value
Bob	1
Alice	2
Chad	1
Diana	2

Company Name	Shard Key
Company A	A
Company B	B
Company C	C

Benefits of database sharding

Improved Performance and Scalability - Sharding enables horizontal scaling, allowing the database to handle larger amounts of data and higher workloads. By distributing data among multiple shards, each shard can handle a subset of the overall data and workload. This parallel processing capability improves the system’s overall performance and scalability.
Enhanced Availability and Fault Tolerance - Sharding provides better fault isolation and availability compared to a single monolithic database. If one shard or database system fails, the other shards remain operational, ensuring that the system can continue to function. Sharding also allows for distributed data replication and backup strategies, further enhancing fault tolerance and data availability.
Geographical Distribution and Localized Data: Sharding facilitates the distribution of data across different geographical regions. This can be beneficial for applications with a global user base, as it enables data to be stored closer to the users, reducing latency and improving the user experience. Sharding also allows for compliance with data sovereignty regulations, as data can be stored in specific regions based on user location.

Drawbacks of database sharding

Increased Complexity - Sharding involves distributing and coordinating data across multiple shards, which can be challenging to set up and maintain.
Data Skew and Hotspots (celebrity problem) - Some shards may experience higher loads or data skew, while others remain underutilized. For example, in the FaceBook example, you might find a shard contains data of celebrities more than others. This shard will experience more traffic compared to others as users requests the profiles of the celebrities. You will have to manage and re balance data across the shards to for an even workload distribution.
Schema Changes: Making schema changes or altering data structures across multiple shards can be more challenging than in a single-database system.

Alternatives to database sharding

Sharding is a horizontal scaling strategy which allocated additional nodes to distribute the user traffic. The major advantage of scaling outwards is that you are not limited by hardware and also it’s fault-tolerance architecture. When one node fails, others will continue to operate without your application experiencing any downtime. However, sharding is one of the methods of database scaling, you can explore other alternatives such as:

Vertical scaling(Scaling up) - Vertical Scaling increases the computing power of the database server. For example by adding a better CPU, more RAM and storage to handle increasing traffic. Vertical scaling is less costly but there is a limit to the hardware resources you can upgrade.
Database Replication - This technique makes the exact copy of the database and stores them across different nodes. Unlike replication, sharding does not create exact copies of the information. It splits a database into multiple parts and stores them in multiple computers. Sharding can be used together with replication to achieve higher scalability and availiability. More details about database replication can be found here.
Database Partitioning - Partitioning is the process of splitting a database table into multiple groups. Partioning can be classified into two types:
- Horizontal Partioning(Sharding) - Database is split into rows.
- Vertical Partitioning - Database is split into columns.

Original User Table

UserID	FirstName	LastName	Gender	Region
1	John	Otieno	Male	Africa
2	Bob	Henry	Male	Europe
3	Mark	Biden	Male	North America
4	Alice	Lee	Female	Asia

Vertical partition 1

UserID	FirstName	LastName	Gender
1	John	Otieno	Male
2	Bob	Henry	Male
3	Mark	Biden	Male
4	Alice	Lee	Female

Vertical partition 2

UserID	Region
1	Africa
2	Europe
3	North America
4	Asia

Partitioning stores all data groups in the same computer, but database sharding spreads them across different computers.

Should I Shard My Databases?

When you application grows in size, there is need to scale. The use of a Sharded database architecture can solve your problems but some sees it as a headache that should be avoided unless when it’s absolutely necessary. Below are some of the scenarios where it may be beneficial to shard a database:

Enormous amount of data - When you are dealing with extremely large data, it’s advisable to split the data into multiple shards to remove performance bottlenecks.
Isolation of Different Customer Data - In multi-tenant application where data from different organizations needs to be isolated. For example in applications such as office 365 sharding the databases is an effective method. Each shard can be dedicated to a specific organization providing data separation, privacy and security.
Geographic Distribution - an application or service has a global user base, sharding can be useful for distributing data closer to users in different regions. Each shard can be located in a specific geographical region, reducing data access latency and improving the user experience.

Conclusion

It’s important to note that the decision to shard a database should be based on careful analysis of the specific requirements, data patterns, and expected growth. Implementing database sharding requires thorough planning, design, and ongoing management to ensure the desired benefits are achieved effectively.

Thank You!

I’d love to keep in touch! Feel free to follow me on Twitter at @codewithfed. If you enjoyed this article please consider sponsoring this blog for more. Your feedback is welcome anytime. Thanks again!

System Design

This post is licensed under CC BY 4.0 by the author.

Introduction

Real life analogy

What is database sharding?

Original User Table

Horizontal Partition 1

Horizontal Partition 2

Sharding Architectures.

Benefits of database sharding

Drawbacks of database sharding

Alternatives to database sharding

Original User Table

Vertical partition 1

Vertical partition 2

Should I Shard My Databases?

Conclusion

Thank You!

Trending Tags