Fundamentals of system design Chapter 5: Databases
The components of system design
When designing any system you will realise that it will require to store some data. That is where a databases comes in. Choosing a database for a job can either make-or-break your system.
Introduction
A database is a collection of data stored in a specific way so at to allow efficient storage and retrieval of data.
A Database Management System(DBMS) is a software that interacts with a database providing an interface for creating, modifying and querying a database. Examples of DBMS include SQL Server, MySQL, PostgreSQL, Oracle and MongoDB.
Types of Database Management System(DBMS)
There are hundreds of databases which have different properties and are optimized for special functionalities. It’s difficult to group the databases, however, when we look at the structure imposed on the data we can come up with two main groups: relational and non-relational databases.
Relational Databases
Relational databases are also known as Relational Database Management System(RDBMS) or SQL databases. They organize and store data in tabular format of rows and columns, making it easy to understand how data relates to one another. The most important ones are MySQL, SQL Server, PostgreSQL and Oracle.
Benefits of Relational Databases
i. Query capabilities - It provides a powerful query language using SQL to allow easy data retrieval and data manipulation.
ii. Simplicity - The main benefit of relational databases is provides an intuitive way to store data and allows an easy access to related data points. It’s free from complex structuring. This is why it’s commonly used by organizations which handle large amounts of structured data, from tracking products to handling transactions.
iii. Normalization - Relational databases employs normalization to reduce data redundancy.
iv. Data Security - It provides multiple levels of security, including user-level to system-level security which protects data stored in database.
v. ACID Compliance - Relational databases supports ACID compliance. ACID is a concept that describe four properties of a transaction of a database system which ensure accuracy and integrity of data. A transaction is a group of operations that are executed as a single unit of work. An example of a transaction is when transferring money between bank accounts. Money must be debited from one account and credited to another. The key property of the transaction are:
Atomicity - When performing a transaction, all the operations are performed or none of them are. Atomicity ensures that if debit is made successfully from one account, the corresponding credit is made in the other account.
Consistency - Consistency ensures that the transactions maintain data integrity, leaving the data consistent and correct. Data constraints must be followed. For example we can have a constraint in the amount column that the data cannot be negative. If the transaction leaves data in invalid state, the transaction will be aborted and an error is reported.
Isolation - Isolation means that the intermediate state of transaction is invisible to other transactions until a commit is done(concurrent control). For example Account A has $1,000 and there are two transactions happening at the same time. Transaction A wants to transfer $1000 to another account and transaction B wants to transact $200 to another account. If these two transaction were allowed, they would leave the account in an invalid state(-$200). To prevent this, a database should only allow only one transaction on an account at a time. The transactions should be done sequentially and put in some sort of a queue.
Durability - Durability ensures that when a transaction is completes successfully, changes will persist in non-volatile memory even in the event of system failure.
Limitations of Relational Databases
i. Expensive - The main drawback of relational databases is increased costs. Relational databases are costly to setup and maintain. It requires performance hardware and a lot of storage which is costly. The initial cost of software can be pricey. It gets worse when you factor in the cost hiring skilled personnel to manage the relational databases.
ii. Scalability - Relational databases may struggle to handle very large amounts of data or a high volume of concurrent transactions. Relational databases are typically designed to run on a single server, you can only scale vertically. Vertical scaling is limited to processors and memory upgrades. To scale relational database effectively it has to be distributed on to multiple servers. Handling tables across different servers is difficult.
iii. Flexibility - Relational databases are designed to store structured data, making it difficult to handle unstructured or semi-structured data such as text or images. They require a schema which makes development difficult if you don’t know the data-shape upfront.
iv. Performance - Relational databases may have performance issues when executing complex queries or when dealing with a high number of joins.
v. Limited Indexing - Relational databases typically have a limited indexing capability, which can lead to poor query performance for large datasets.
Non-Relational Databases
Non-relational databases are also known as non-relational database management system or NoSQL(not only SQL). They are non-tabular. There are four major types of NoSQL databases: document, key-value, wide columns and graph databases.
Document Databases - Stores data in a document. A document is a container for key-value pairs. The document is stored in JSON or XML format. The are unstructured and don’t require a schema. The documents are grouped together into collections. The collections can be organized into hierarchy allowing you to model and retrieve relational data to a significant degree without joins. They are mainly suited for Online games, IOT, mobile apps etc. The most popular ones are Firestore, DynamoDB, CouchDB and MongoDB.
Key-Value databases - Store data in a format where each item contains keys and values. They store data in a hash table where each key is unique just like in dictionaries and the value can be JSON, string, BLOB(Binary Large Object) etc. Key-Value data store mainly store data in memory and not the disk. This limits the data you can store but makes it extremely fast.They are mainly used for data caching.A good example is Redis or memcached.
Wide-Column database - A wide-column database, also known as a column-family database, is a type of NoSQL database that organizes data into rows and dynamic columns. It just like key-value databases but it adds columns in the values. Individual rows then constitute a column family. This allows for more flexible and efficient storage of large amounts of unstructured data, as well as faster querying and retrieval of specific data subsets. Column Dbs are ideal when your app is write heavy with little updates. A popular use case is for storing time-series data such as records for an IOT device. Examples of wide-column databases include Apache Cassandra, Hbase and Amazon SimpleDB.
- Graph Databases - Stores data in nodes and edges. The nodes typically store information about people, places and things while edges store information about the relationship between the nodes. They are a good alternative to relational databases especially when you require to make a lot of joins and performance is taking a hit. Graphs are useful when identifying patterns between structured and semi-structured data. They are mainly used in fraud detection in finance and to power recommendation engines like in AirBnB. Popular ones are neo4j and Dgraph.
Benefits of NoSQL database
i. Flexibility - Relational databases strictly store data in structured format. NoSQL provides support to flexible schemas that enables faster development. The flexible data model is perfect for unstructured and semi-structured data. NoSQL databases allows developers to change the structure of data without disrupting the current structure.
ii. Scalability - Relational databases are mostly implemented using scale-up architecture which means you increase computer performance by upgrading CPU’s and adding more memory. Non-relational databases on the other hand, were designed to scale out by using distributed clusters of hardware. The scale-out architecture makes NoSQL easily scalable when data and traffic grows.
iii. Simplicity and Developer-friendly - Many NoSQL databases are simple and developers find it easier to create various types of applications compared to NoSQL.In addition, NoSQL store data in forms that are close to data objects therefore fewer transformations are required when moving data in and out of the databases.
iv. Large amounts of data storage - Many NoSQL databases can handle large data sets making them perfect for big data applications. NoSQL is great when you are dealing with large volumes of data.
Limitations of NoSQL database
i. The lack of SQL - A major complaint about NoSQL is that it does not come with SQL which a mature technology for querying databases. They lack the complex tool functionality found in SQL.
ii. Lack of Standardization - Every NoSQL databases uses it own schema. Such databases like MongoDB, there is no schema. It is dynamic in some other databases like ElasticSearch. Some other database designs resemble relational databases(for example cassandra). The problem here is that each database has it’s own unique strengths and weaknesses which must be learned before choosing the right database for a project.
iii. The lack of ACID - We discussed ACID as the properties that define a transaction. NoSQL does not support these properties.Transactions provide the strongest guarantees available around data consistency.
Final
Always keep in mind the following factors when choosing a database for a new application:
i. Data model - Consider the structure of your data and how it will be stored and retrieved. If your data is highly structured and requires complex relationships, a relational database like MySQL or PostgreSQL may be a better fit. If your data is unstructured or semi-structured, a NoSQL database like MongoDB or Cassandra may be more appropriate.
ii. Scale - How much data do you expect to store and how quickly do you need to retrieve it? If you expect to store a large amount of data and need to scale horizontally, a NoSQL database may be better suited for your needs.
iii. Performance - Consider the performance requirements of your application. Some databases are better suited for high-write workloads, while others are optimized for high-read workloads.
iv. Availability and durability - Consider how important it is for your application to be available and how you will ensure data durability. Some databases offer built-in replication and fault tolerance, while others require additional setup and configuration.
v. Transactions - The ACID properties in relational databases gives you a guarantee around data atomicity, consistency, isolation and durability when dealing with transaction operations.
vi. Cost - Consider the total cost of ownership, including licensing, hardware, and maintenance costs.
It’s important to note that there is no one-size-fits-all solution and the best database for your application will depend on your specific requirements. When engineering at scale is all about trade offs. Effortless horizontal scalability in NoSQL databases comes with a hidden cost. It’s important to dive deep to find out where that cost is hiding.
Thank You!
I’d love to keep in touch! Feel free to follow me on Twitter at @codewithfed. If you enjoyed this article please consider sponsoring this blog for more. Your feedback is welcome anytime. Thanks again!
Comments powered by Disqus.