Fundamentals of System Design Chapter 9: What is CAP theorem?

July 10, 2023 3 minute read

Introduction

Have you ever come across an advertisement for services like graphic design or house painting that begin with a catchy headline: Cheap, Fast and Good: Pick Two”? Cheap represents the financial aspects of the project. Fast represents the speed at which the tradesperson completes the project and Good signifies the level of craftsmanship and excellence.The tradesperson is simply saying, it’s unreasonable for a client to expect to get all the three. There is a trade off and the client should choose what to forgo:

Fast and good = expensive
Cheap and fast = low quality
Good and cheap = takes time.

What is CAP theorem?

In the tradesperson analogy, you are faced with a trade-off, similar to the CAP theorem. The CAP theorem states that in networked shared-data systems or distributed systems, we can only achieve at most two out of three guarantees for a database: Consistency, Availability and Partition Tolerance. A distributed system is made up of multiple nodes of storing data and can only deliver two of three desired characteristics: Consistency(C), Availability(A) and Partition tolerance(P).

1. Consistency

Consistency mean every read request from clients receives the most recent write. All clients see the same data no matter which node they are connected to. For this to happen, whenever there is a write in a database node, the data should be replicated to all other database nodes. In other words, updates to the system are immediately reflected in all nodes. The data should be consistent.

2. Availability

Availability means that any client making a request should always get a valid response even when some nodes are down or unavailable.

3. Partition Tolerance

A partition is a communication break, which happens when there is a lost or delayed connection between 2 or more nodes in a distributed system. Partition tolerance means that the nodes cluster should continue functioning despite a number of communication breakdowns between nodes.

To make easier to understand, let’s look at a concrete real world example.

Real World Example

Let’s say you have developed a system and it gets quite popular. You decide to scale horizontally by increasing 2 database nodes(n2 and n3) in order to serve clients requests more effectively.

Ideal situation

When data is written on the n1, it is replicated to the other nodes(n2 and n3). Both data consistency and system availability are achieved. This is an example of CA (consistency and availability) systems. A CA system cannot exist in real world applications since network failure is unavoidable. Network partitioning generally has to be tolerated.

Normal situation

In a real world, network partition cannot be avoided, and when it occurs we must choose either consistency(CP) or availability(AP). When n3 goes down and it cannot communicate with n1 and n2, if clients write to n1 and n2, the data cannot be propagated to n3. This means n3 will have stale data when a user makes a request via this node.

A CP (consistency and partition tolerance) system like a bank will choose consistency over availability. They must block all write operations to node n1 and n2, to avoid data inconsistencies among the 3 nodes which will make the system to be unavailable. Bank systems have extremely high consistency requirements, for instance it’s crucial for a bank to display the latest bank balance of a client to avoid over withdrawal. If data inconsistency occurs due to network partition, the system should be unavailable and return an error until data inconsistencies are resolved.

An AP (availability and Partition Tolerance) system will keep accepting client data reads even when the data is stale. n1 and n2 data nodes will keep accepting data writes and the data will be synced once the data partition is resolved. This architecture is popular in systems like social media platforms where users uninterrupted access and interaction with the system is prioritized. Social media users can continue browsing through the content even if the data stale.

Conclusion

Distributed systems makes a group of computer nodes to work together to achieve high computing power and high system availability that was not available in the past. The system has a lower latency, a higher throughput and a near 100% up time. Distributed systems offer many benefits, however, according to the CAP theorem, you can only prioritize two out of the three properties: Consistency, Availability, and Partition tolerance. This simply means, in the event of a network partition, you have to choose whether to prioritize system availability and sacrifice consistency or vice versa. It’s important to note that the CAP theorem does not provide a prescriptive answer but rather highlights the trade-offs that need to be considered in distributed system design. Different systems may make different choices based on their specific requirements and constraints.

Share on

Twitter Facebook LinkedIn

Fredrick Kamau

Fundamentals of System Design Chapter 9: What is CAP theorem?

Introduction

What is CAP theorem?

Real World Example

Ideal situation

Normal situation

Conclusion

Share on

Leave a comment

You may also enjoy

Fundamentals of System Design Chapter 8: Understanding Database Indexing

Fundamentals of system design Chapter 7: Understanding Database Sharding

System Design Chapter 6 - Database Replication

Fundamentals of system design Chapter 5: Databases