Post

Fundamentals of System Design Chapter 8: Understanding Database Indexing

Introduction

Imagine you have a large library with thousands of books. Let’s say a client walks in and requests a book by it’s title. Without any organization, you would have to search all the books until you find the one you are looking for. This process is inefficient and time consuming.

However, if the library has a well designed indexing system it gets very easy to locate any book. The indexing system will contain a catalogue that lists all the books in an alphabetical order along with their corresponding shelf numbers. By consulting the index, you can easily pinpoint the exact location of a book request saving you time and effort.

What is database Indexing?

Database Indices works in a similar way. In the analogy above, the library represents a database, the books represents the data stored and the library index represents a database index. A database index is a powerful tool that enables you to quickly retrieve records from a database file. An index is a data structure that improves the speed of data retrieval process. An index in a database is also very similar to a book index.

Computer Example

Let’s say you have a large database that has a user’s table which contains information for about 1 million users.

IDFirstNameLastNameEmailCountry
1JamesOmollojames@gmail.comKenya
2GbemisolaOluwasegunOluwasegun@gmail.comNigeria
3TrishnaPranavPranav@email.comIndia
4EstherColemanColeman@email.comUK
5ZaraShahidiShahidi@email.comUSA

Assuming Zara is the last record on our table, when you query the database for her details it will take a long time since you will have to traverse through all the 1 million records(full table scan). If the data was ordered alphabetically using the first name, searching could happen faster since we could jump down to halfway through the data to see if Zara comes before or after that row. We could then half the remaining rows and make the same comparison by use of binary search. Indices helps us create a sorted lists without really creating sorted tables which would take a lot of space. Let’s create an index and see how it maps back to the user table using “FirstName” column. In this example we will use Postgres Database.

1
CREATE INDEX users_firstName ON users(FirstName);
FirstNameID
Esther4
Gbemisola2
James1
Trishna3
Zara5

The Index has user’s first names sorted in alphabetical order. When you perform the SQL Query below to get a user’s record using FirstName as the filter, the database can swiftly locate the relevant rows, rather than scanning the entire table sequentially. The key will be used to quickly find the rows that match a specific FirstName. This indexing technique significantly improves the query performance, especially when dealing with larger datasets.

1
SELECT FirstName From Users WHERE FirstName = 'Zara';

Remember, indexing is not limited to a single column. You can create Indices on multiple columns or even compound Indices that span across multiple columns, depending on your specific use cases and query patterns.

Index architectures

1. Clustered Index

Popular SQL database like SQL server automatically create a clustered index if a constraint or primary key is defined on a table. Each table can only have one clustered index based on the primary key. The PK defines the order in which the data is physically stored on disk. Use the SQL script below to create users table and insert data in an SQL Server Database.

1
2
3
4
5
6
7
    CREATE TABLE users (
        ID INTEGER PRIMARY KEY NOT NULL,
        FirstName VARCHAR(50) NOT NULL,
        LastName VARCHAR(50) NOT NULL,
        Email VARCHAR(100) NOT NULL,
        Country VARCHAR(50) NOT NULL
    );
1
2
3
4
5
6
    INSERT INTO users (ID, FirstName, LastName, Email, Country) VALUES
           (2, 'Gbemisola', 'Oluwasegun', 'Oluwasegun@gmail.com', 'Nigeria'),
           (3, 'Trishna', 'Pranav', 'Pranav@email.com', 'India'),
     (1, 'James', 'Omollo', 'james@gmail.com', 'Kenya'),
           (5, 'Zara', 'Shahidi', 'Shahidi@email.com', 'USA'),
           (4, 'Esther', 'Coleman', 'Coleman@email.com', 'UK')

When you query the database:

1
SELECT * FROM users

We get user’s data sorted ascending by the ID(PK). Remember we did not insert the data ascending by the PK.

The database automatically creates a clustered index. Run the following command to confirm.

1
 Execute sp_helpindex users;

A clustered index determines the physical order of the data rows in a table. However inserts and updates can be more expensive compared to non-clustered Indices, as they may require data to be physically rearranged.

2. Non-Clustered Index

Our library analogy is a good example of a non-clustered index.The data is stored in one place, the index in another place. The index contains pointers to where the data is stored. Since a non-clustred stores data separate from actual data, the table can contain more than one clustured index. Just like how a book can have an index by chapters at the beginning and another one index at the end. Inserts and updates on non-clustered Indices are generally faster than on clustered Indices, as they don’t require rearranging the data. Non-clustered Indices are commonly used to optimize query performance for specific columns that are frequently accessed.

Types of Database Indices Implementations

Databases Indices make use of data structures and algorithms to access indexed data efficiently. The most popular ones are:

1. Binary-Tree(B-Tree)

This is the most popular database index data structure.It organizes data in a self-balancing tree, typically a binary search tree. It provides efficient searching, insertion and deletion operations by keeping the tree balanced and minimizing the number of disk access required.

2. Hash

Hash indexes use a hash function to map keys to specific locations in the index. The hash function distributes the data uniformly across the index, allowing for fast lookup operations. Hashing architectures are efficient for equality searches, but they may suffer from collisions, where multiple keys hash to the same location, requiring additional handling.

Benefits of database Indices

  1. Improves query performance for Frequently accessed columns: If certain columns are frequently used in queries, using searching, filtering or join operations, it’s beneficial to index those columns to retrieve data quickly and improve query performance.

  2. Faster Data retrieval for Large Tables: Indexing is very useful for large tables that contain a significant amount of records. Without using Indices, querying large tables can be very slow and resource intensive.

  3. Improved Scalability: Indexing improves scalability in a database system. As the amount of data grows, Indices provides a way to efficiently access and retrieve data ensuring that query performance remain consistent even with large datasets.

Drawbacks of Database Indices

While database Indexing offers numerous benefits, it has some of the following drawbacks:

  1. Overhead during Data Modification: When performing data modifications such as create, update or delete in a table that has an index, the Indices need to be updated to reflect on the data changes. These operations incur some overhead. The more Indices a database has, the longer it takes to perform data modifications, potentially affecting write operations.
  2. Increased Storage Space: Indices requires additional storage space to store the index data structures. Indices can consume a significant amount of disk space and the space requirement increases proportionally with the number of indexed columns and the size of indexed data.

Conclusion

Database indexing is an important technique you can add to your toolbox to improve query performance and data retrieval in databases. When you create Indices on frequently accessed columns, the database can quickly map and retrieve data based on the specified conditions resulting in faster execution times. Indexing is particularly beneficial in large tables, complex queries involving multiple table joins and frequently accessed tables. It’s important to carefully consider columns and queries that would benefit most from indexing. Creating Indices for every column can result to unnecessary overhead and additional costs.

Thank You!

I’d love to keep in touch! Feel free to follow me on Twitter at @codewithfed. If you enjoyed this article please consider sponsoring this blog for more. Your feedback is welcome anytime. Thanks again!

This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.