12 minute read

Fundamentals of System Design

A beginners guide to system design(Part 1)

What is system design

System Design is a major phase of software development; it’s the process of defining the elements of a system like architecture, components, modules, interfaces and data for a system based on the specified requirements.

Why should I learn system design?

In the last two decades there has been a lot of advancement in large-scale web applications. These advancements have redefined how we build software. All the popular applications and services that we use everyday like Netflix, YouTube, Facebook, Office365 and Twitter are highly scalable distributed systems. These systems handle billions of traffic everyday thus is need to design the systems to tackle the amount of traffic and data with zero failure and that’s where system design comes in. System design requires you to think about everything from the infrastructure, hardware and software, all the way down to the data and how it’s stored. Learning system design will help you design systems that are resilient i.e scalable, available and efficient. Your task as a developer is to understand the basic concept of system design and when to apply them in real world software solutions.

System Design Performance Metrics

The Metrics below are used to measure the performance of a system.

1. Scalability

Scalibility describe’s a system elasticity. It refers to a system ability handle or withstand an increase in workload without sacrificing on system performance. It’s the ability of a system to grow and manage increased volume of requests(traffic) from users over time. The servers should be powerful enough to handle increased traffic loads so that there is no decline in service quality. A poorly designed system will result to a bottleneck in the size of traffic it can handle or exponentially increasing costs with a small increase in traffic. There are two ways of scaling an application:

a) Horizontal Scaling(Scaling out)

Horizontal scaling also known as Scaling out simply means adding more nodes/machines to your infrastructure to the existing hardware resource pool to cope with new demands. If you are hosting an application in one server and the popularity of your application grows in a level that it can no longer handle the traffic comfortably, adding a new server maybe a solution. This is similar to delegating workload to several employees instead of one.

Advantages of horizontal scaling

i) Eliminates a single point of failure - Relying on a single node for data and all your operations is risky since you will experience downtime when it fails. Distributing your load among multiple nodes eliminates a single point of failure which makes your system resilient to random hardware failures.

ii) Fewer periods of downtime - Since you are adding more machines to an existing pool you don’t need to switch off the old one when scaling therefore downtime is almost non-existent.

iii) Increased System Performance - Horizontal scaling provides additional endpoints for client connections which distributes the load across all the nodes which increases system performance.

Disadvantages of horizontal scaling

i) Increased Complexity - Multiple servers are harder to maintain instead of one. You will need to configure load balancing and ensure that the nodes communicate and synchronize effectively.

ii) Expensive - Adding multiple servers is more expensive compared to upgrading existing ones.

b) Vertical Scaling(Scaling up)

Vertical scaling also known as scaling up means to adding additional resources to your server such as upgrading CPU’s, Memory, Storage, and Network speeds.

Advantages of vertical scaling

i) Cost Effective - It’s cheaper compared to horizontal scaling. The cost of upgrading the current server is cheaper compared to purchasing and maintaining additional servers. Additionally you can configure a virtualization software when scaling up. This will have all the benefits of horizontal scaling.

ii) Simplicity - Since everything runs on a single machine

Disadvantages of vertical scaling

i) Single point of failure - Having your application reside on a single instance introduces a single point of failure. Software or Hardware failures such as malware, corrupted software, physical accidents in the data center can lead to a costly downtime.

ii) High Possibility of Downtime - Your system will experience some downtime when upgrading the servers unless you have a backup server in place.

iii) Hardware Limitations - There is a limit on how much you can upgrade a machine. Every machine has a threshold for memory, Storage and Processing power.

2. Reliability

System Reliability is the probability that a system will perform correctly during a specific time duration. A system is reliable when it adequately follows the defined performance specifications and no repair is required during that period. It’s obvious that hardware depreciates with time which has an effect on a system reliability. On the other hand, it’s difficult to measure software reliability; responses on client requests could slow down but still be accurate. A reliable system should continue working even when the software or hardware components fail. Any failing component should be replaced immediately with a healthy one to ensure the completion of a requested task.

Take for example, a large online store like Amazon, where one of the primary requirements is that a transaction should never be canceled due to the failure of the node running the transaction.For example, if a user adds an item to a shopping cart and proceeds to payments, the system is expected not to lose it even if the server carrying the transaction fails. A reliable system should be fault tolerant i.e detect failures and migrate the transaction task to another redundant server for completion. A resilient system should be able to eliminate every single point of failure.

A common way to measure reliability is by using Mean Time Between Failure(MTBF). MTBF is the average time between system breakdowns which measures performance of a system. MTBF is calculated by taking the total time a system is running(uptime) and diving it by number of failures(downtimes). For instance if a system is operational for 100 hours, it breaks down two times for 3 hours and an addition of 4 hours the MTBF can be calculated as follows:

MTBF = (100hrs - 7hrs)/2 breakdowns = 93 hours/2 breakdowns = 46.5 hours
3. Availability

System availability is the probability that a system works properly when it’s requested for use. It means that the system is available for use as a percentage of scheduled up time and is not due to problems or other interruptions that are not scheduled. It’s a measure that a system has has not failed or undergoing repair when it needs to be used.

Availability Calculation

Let’s say that your system runs for 24 hours a day. The system had a one hour unplanned downtime because of a breakdown. The system availability can be calculated as follows:

availability % = (available time / total time) * 100
availability % = (23 hours / 24 hours) * 100
= 95.83%

The system availability was 95.83%. This might seem like a high score but in software, 95.83% availability is not good as shown in the table below.

System availability of 90% equates to 36.5 days per year. An online marketplace like Amazon will lose billion of dollars every year even with an availability of 95%. Cloud computing services like azure, AWS and google cloud has Service Level Agreements(SLA’s) to commit to system reliability and availability to define standards that will keep your systems running smoothly despite of any disruptions. The 5 nines in system availability basically means that your an availability of 99.999% which is a common SLA between companies.

It can be a challenge to measure system availability when using the micro services architecture since some components might be less available compared to others. This can be covered up by having redundant(backup)/replicated servers just in case one fails. A load balancer can detects when a server fails so that it can use a backup server.This will in turn increase availability.

Availability Vs Reliability

What is the difference reliability and availability?

Availability is a measure of the % of time the system is in an operable state while reliability is a measure of how long the item performs its intended function without breaking down. However, reliability and availability go hand in hand. An increase in reliability translates to an increase in availability. It’s important to keep in mind that both metrics can produce different results. You might have a highly available machine that is not reliable. Take for example a commercial blender that is operating close to it’s maximum capacity. The Motor can run for several hours a day which implies high availability. However it may need to cool for every half an hour to resolve operational problems. Despite it’s high availability, the blender is not a highly reliable equipment.

Best Practices to improve system availability and reliability

The goal of high availability is to minimize system downtime and/or minimize the times needed to recover from an outage. This can be achieved by:

i) . Build with failure in mind - Always plan on your application and services failing. As the CTO of Amazon, Wener Vogels says, “Everything fails all the time”. Using design constructs such as a simple try catch methods, retry logic and circuit breakers allows you to catch errors. This will allow you to limit the scope of the problem and your app will continue working even if parts of the application is failing. Circuit breaker pattern are useful for handling dependency failures since they can greatly reduce the impact a dependency failure has on your system.

ii) . Always think about scaling - An application that generates a certain amount of traffic today might generate a lot more traffic sooner that you anticipate. As you build your app, don’t build it for today’s traffic, but for tomorrow’s. This can be achieved by building an application in a way that you can add additional servers and increase the size and capacity of your databases easily when needed.

iii) . Reduce single points of failure. - Eliminate all single points of failures from your application infrastructure. Since all hardware fails at some point, eliminate the impact that it will cause on your application. This means backups of everything: servers, routers, switches, power sources etc that you anticipate. As you build your app, don’t build it for today’s traffic, but for tomorrow’s. This can be achieved by building an application in a way that you can add additional servers and increase the size and capacity of your databases easily when needed.

iv) Monitor the application - Make sure your application is instrumented to see how application is performing. Instrumentation tools monitors the health of servers, monitor the performance of application and services, synthetic testing(examines in real time how the app is working from users perspective) and alerting appropriate personnel when problems occur so that it can be quickly resolved. As you build your app, don’t build it for today’s traffic, but for tomorrow’s. This can be achieved by building an application in a way that you can add additional servers and increase the size and capacity of your databases easily when needed.

v) Respond to downtime in a predictable way - Monitoring issues are useless unless you are prepared to act on issues that arise. You should establish processes that your team follows to diagnose and fix common failures scenarios. The standard processes should be prepared ahead of time so that during a downtime/outage the owner of the related service should be alerted to restore the service quickly.

4. Efficiency

System efficiency measures how well a system works. The two metrics used to measure system efficiency are Latency and Throughput.

i) Throughput

Throughput refers to how much data can be processed within a specific period of time. It’s a measure of the quantity of data being sent or received within a unit of time. The unit used to measure throughput is megabits per second(Mb/s). For example 1TB of data can be processed per hour. In the case of a client-server system, client throughput is the amount of responses per time a client can get for requests made. Server throughput, measures how many requests per time(usually in seconds) a server can process.

ii) Latency

Latency is a measure of delay. The unit used to measure latency is Millisecond. In a client-server system, there are two types of latency:

  • Network Latency - It’s the amount of time it takes for data/packets to travel from a client to the server.The time can be measured as one way or as a round trip.
  • Server latency - It’s the time taken by the server to process and generate a response.
Why is latency and throughput important?

If the latency is high, this means that there is a high delay in the responses. If the throughput is low this means that the amount of requests processed are low. High latency and low throughput impairs the performance of a system. There are systems such as games where latency matters a lot. If the latency is high, a user will experience lagging which will drastically impair the user experience. When making database queries one can improve server latency/throughput by using cached memory. The following is an example of latency tests.

Latency Tests

Latency tests carried across the key data storage such as in-memory cache, HDD, SDD and network calls reveals the following:

i) Reading 1MB sequentially from cache memory takes 250 microseconds.

ii) Reading 1MB sequentially from an SSD takes 1,000 microseconds or 1 millisecond.

iii) Reading 1MB sequentially from disk (HDDs) takes 20,000 microseconds or 20 milliseconds.

iv) Sending 1MB packet of data from California to Netherlands and back to California using a network takes 150,000 microseconds.

1000 nanoseconds = 1 microsecond

1000 microseconds = 1 millisecond

1000 milliseconds = 1 second

Therefore reading from in-memory cache is 80 times faster than reading from HDD disk!

Summary on Fundamentals of System Design Part I

In this article we talked about the metrics that affect the performance of a system. These are key characteristics of a system that every developer should understand. Watch out for part II where we will talk about the actual components that you can implement in a system such as load balancers, proxies and caches.

Thank You!

I’d love to keep in touch! Feel free to follow me on Twitter at @codewithfed. If you enjoyed this article please consider sponsoring me for more. Your feedback is welcome anytime. Thanks again!


Leave a comment