book page

Overview

Databases

Introduction

There comes a point in every budding developer’s life when they need to use a database for the first time. The function of a database – reliably storing and retrieving data – is so fundamental that it’s hard to do much programming without encountering one. Yet databases sometimes seem a little unfamiliar, a little other. The database is separate to your code and interacting with it often requires using a different language, SQL. It’s tempting to learn the bare minimum and stick to what you’re comfortable with. Databases also have a reputation for being a little boring. Perhaps it’s the old-fashioned syntax of SQL or all those tutorials about storing employee records.

In this chapter, I hope to convince you that databases are in fact deeply interesting and well worth engaging with properly. They’re like a microcosm of computer science. Parsing a SQL query and generating an execution plan touches on programming language design and compilers. Databases rely on clever data structures to enable fast access to huge amounts of data. Sophisticated algorithms ensure correct concurrent behaviour and that data is never lost. The database is one of those applications that acts as a linchpin, drawing together everything that we’ve studied so far.

Databases come in a huge variety of shapes and sizes. I’m going to focus solely on relational databases, which are by far the most popular. We’ll start with the theoretical underpinnings of relational databases. From there, we’ll look at the four important guarantees that databases provide, which go by the funky acronym of ACID. The rest of the chapter will explore a typical database architecture. Due to space constraints, I’m going to assume that you have worked with a database before and can do basic SQL queries. Maybe you’ve even designed a schema for a project. But you don’t yet know why or how it all works. If this is all new to you, worry not. Introductory resources are referenced in the further reading. We won’t cover any advanced SQL tricks as I believe that they are easier to learn by solving real life problems as you encounter them.

Throughout this chapter, I’ll make reference to two databases, Postgres and SQLite, depending on which seems like the better example for a given point. Both are relational databases and so are conceptually very similar but they have very different implementations. Postgres is a high performance database offering oodles of cutting edge features. It runs as a server that clients must connect to in order to perform queries and operations. SQLite, as the name suggests, is much more lightweight and stores the entire database in a single file. Its small footprint makes it a popular choice for embedding a database in other applications. Google Chrome uses SQLite to store internal data such as browsing history.

Finally, a point on terminology. The term “database” is overloaded. Technically, “database” refers to the collection of entities and their relations that make up the application’s data. Programs like Postgres and SQLite are examples of relational database management systems (RDBMS). In this chapter I’ll use “database system”, “database engine” or just “system” to refer to the program that manages the database.

What does a database offer?

Working with a database entails understanding a whole other programming paradigm, learning a new language and figuring out how to integrate the system into an existing application. Why go to all this bother when we can just store application data in a plain file?

Databases offer numerous benefits over a plain file. They provide assurances that your data is protected from loss and corruption. If you use plain files, you’re responsible for writing all that code yourself: opening and parsing the file, making bug-free modifications, gracefully handling failures, catching all of the edge cases where data can get corrupted and so on and so on. Doing all this correctly is a huge task. Just as you probably don’t want to be writing your own OS, you probably don’t want to be writing your own data management layer. On top of protecting you from disaster, databases also offer a bunch of neat features. They allow concurrent access by multiple users, improving your application’s performance. They offer a sophisticated query interface in the form of SQL, meaning you avoid having to hand-code every query. It’ll also run those queries much faster than anything you could do yourself.

In short, database systems abstract away the gory details of data storage and management. They provide a query interface that, while sometimes a little clunky, is immensely powerful and creates a clear boundary between your data and your business logic. They provide important assurances so that you don’t have to worry about data loss or corruption. Database systems have four properties that guarantee that the data they contain remains safe even in case of failures: atomicity, consistency, isolation and durability. Together they go by the acronym ACID. Let’s look at each in turn:

Atomicity

Atoms were so called because they were believed to be indivisible. Of course, we now know that they’re not indivisible after all – awkward! In computing terms, an operation or sequence of operations is atomic if it is either performed to completion or not performed at all. No halfway state is allowed, even if there’s some kind of catastrophic power failure mid-way through. This is very important from the point of view of data safety because we never want to get into a situation where an update to the data is only partially performed.

The canonical example in the textbooks is transferring money between two accounts. Peter wants to pay Paul £100. This involves two steps. First, credit Paul’s account with £100:

1 UPDATE accounts
2 SET balance = balance + 100
3 WHERE account_name = "Paul";

Then, debit Peter’s account by £100:

1 UPDATE accounts
2 SET balance = balance - 100
3 WHERE account_name = "Peter";

Note that it doesn’t matter which statement is executed first as long as they both execute. However, if an error were to occur after crediting Paul but before debiting Peter (perhaps the database system crashes or Peter doesn’t have enough money in his account) then the system erroneously “creates” money by giving Paul money without taking it from Peter. A fortuitous outcome in this case, but if the statements were executed in a different order money would disappear from the database!

Of course, the system can’t perform its operations instantaneously so there will inevitably be times when it is in the middle of performing some kind of modification to the database. How do we prevent a failure at this point from messing up the entire database? We introduce a concept known as a transaction:

 1 BEGIN
 2 
 3 UPDATE accounts
 4 SET balance = balance + 100
 5 WHERE account_name = "Paul";
 6 
 7 UPDATE accounts
 8 SET balance = balance - 100
 9 WHERE account_name = "Peter";
10 
11 COMMIT

The transaction is created by wrapping the steps in BEGIN and COMMIT statements. A transaction consists of a sequence of operations that must all succeed or all fail. We’ll see later how a database system ensures that this happens, but at a high level the system simply doesn’t consider a transaction to be completed, or committed, until all of its operations have been successfully committed. If one fails, the system performs a rollback, undoing all of the changes to return the database to its state before the transaction began.

Consistency

The consistency property ensures that the database always makes sense from the application’s point of view. A table schema defines the table’s attributes and their types. For example, a user may have a “salary” attribute, which must be a number. Constraints further restrict the possible values that an attribute can have. The “salary” attribute might be constrained to never contain a negative value because a salary can never be less than zero. If we gather all of the table schemas from a database, we have a set of expectations about the data. We have data safety when the data is consistent with these expectations. Some expectations may span multiple entities. Referential integrity requires that a foreign key in one table refers to a valid primary key in another.

Very common constraints are to state that a column can’t contain a NULL value or that every value must be unique:

1 CREATE TABLE products (
2     product_no integer,
3     name text NOT NULL,
4     price numeric,
5     UNIQUE (product_no)
6 );

In this example table, every product must have a name and its product_no must be unique. More sophisticated, custom constraints are also possible:

1 CREATE TABLE products (
2     product_no integer,
3     name text NOT NULL,
4     price numeric CHECK (price > 0),
5     UNIQUE (product_no)
6 );

Now the database will abort, or completely undo, any attempt to create or update a product with a negative price. This fits with our basic understanding of how prices work. It might sound obviously ridiculous to try to give a product a negative price, but it could happen quite easily if you’re applying a series of discounts to a product and make a mistake in the calculations. Constraints don’t mean that you don’t need to do any validation or testing in your application code. You should still try to detect invalid data as soon as it’s generated and prevent it ever getting near the database. Constraints are more like a last line of defence.

How much complex business logic you should encode in the database is a matter of debate. If there are multiple applications connecting to the same database – a web app, a mobile app etc – then it makes sense to write the logic once in the database rather than re-implementing it (and possibly introducing bugs) in each new client app. Sceptics retort that you’ll end up with business logic spread through constraints and each client app, creating an unholy mess.

Isolation

Sometimes the database needs to move through intermediate, inconsistent states before it arrives at a new, consistent state. We rely on transactions to make this happen. Consistency guarantees that a transaction can always see the results of all previous, committed transactions. Isolation ensures that a transaction’s intermediate states cannot be observed from outside of the transaction. As long as the transaction ends in a consistent state, the rest of the database will only see a change from one consistent state to another.

Without isolation we could have situations like the following. Transaction A begins. Halfway through, when the database is in an inconsistent state, transaction B starts using the inconsistent state as its starting point. Even though transaction A eventually reaches a consistent state, the inconsistency leaks out to transaction B. Isolation ensures that the effects of one transaction do not become visible to another until the first has completed.

With proper concurrency controls, which we’ll look at soon, multiple transactions can run concurrently and leave the database in the same state as if they’d run sequentially.

Durability

Finally, durability ensures that a change committed to the database is never lost, even in case of power failure. Of course, the database is limited by the hardware it’s running on. It can’t do much if the hard disk gets smashed to pieces!

The database system has to be sure that changes are written to persistent storage. It must play an interesting game with the operating system. As mentioned in the OS chapter, the OS will try to improve I/O performance by buffering writes to persistent storage. The sneaky OS might tell the database engine that it’s written the data to disk when really all it’s done is write the data to an in-memory buffer. If the OS then crashes, the data will be lost. It’s a bit like if your boss had asked you to complete a task and you’d claimed that it was done when really you’d only asked a colleague to do it. Hopefully your colleague will do it correctly and everything works out fine, but maybe they’ll get hit by a bus and suddenly your boss is wondering why that thing you said was done hadn’t even been started.

The database has to be very careful to ensure that it logs important events to persistent storage so that they can be recovered in the event of system failure.

Relational algebra and SQL

Before we delve into the gnarly details of how databases work, it is important to understand a little bit of the theory that underpins relational databases.

Relational databases are so called because they rely on relational algebra. In common parlance, algebra is a set of operations – addition, subtraction and so on – that we can perform on numbers. A very important insight is that there are many algebras. The one we learn in school is just one instance of a more general concept. To make an algebra, all you need is a set of elements and a set of operations that can be performed on those elements. Relational algebra defines operations that can be performed on relations. It was developed as a model for databases by Edgar Codd in 1970 (see further reading). The idea proved wildly successful and relational databases are by far the most dominant database type.

A relation is a set of tuples matching a schema. The schema is a set of attributes, which are name and type pairs. A tuple is an ordered collection of elements corresponding to each attribute. The schema defines what entity the relation describes and the tuples are instances of the entity. It’s an unwritten rule that every book on databases has to include an example modelling employees. This book is no different. Here is our example employee schema:

1 CREATE TABLE Employee(
2   employee_id INTEGER PRIMARY KEY,
3   name string,
4   salary integer,
5   department string NOT NULL);

SQLite, which I’m using for these examples, will autoincrement the employee_id column automatically because it is the primary key. For simplicity I’m treating the department as just a string. Here are the tuples in our example relation:

1 employee_id  name         salary      department
2 -----------  -----------  ----------  ----------
3 1            Amy Johnson  40000       Tech
4 2            Bob Wilson   30000       HR
5 3            Hafsa Hasad  35000       Sales
6 4            Anna Jones   25000       Marketing

(2, "Bob Wilson", 30000, "HR") is a tuple consisting of four values which together represent an individual employee. Note that the ordering of the tuple attributes is significant. The reason we know “Bob Wilson” refers to his name and not his department is due to the values’ positions in the tuple.

In database systems, relations are implemented as tables. Each tuple in the table is known as a row. The schema attributes are columns in the table. Between rows there is no ordering. The database has no particular reason to put one employee above another. It’s sorted by employee_id here but any other attribute is just as valid.

Tuples are uniquely identified by their values. This presents a problem for database systems. What if there are two John Smiths working in Accounts and both earning £45,000? We need to a way to disambiguate them. The solution is to create a primary key: a column, or combination of columns, that uniquely identifies a tuple. Sometimes the tuples contain an attribute that naturally works as a unique identifier. The employee_id plays that role here. If necessary, an additional, autoincrementing attribute can be added to the schema. With a primary key, we can disambiguate tuples that have otherwise identical attributes.

Even better, we can express the idea of a relation by allowing tuples in one relation to hold the primary key of tuples from a different relation. When a relation has an attribute holding the primary key of another relation, we refer to that attribute as a foreign key. A more realistic Employee relation would have a separate Department relation and store the employee’s department as a foreign key. This is how we can express relationships between relations. It’s simple but very powerful.

SQL (Structured Query Language) is a programming language designed to express relational algebraic queries. It’s a practical implementation of the mathematical ideas behind relational algebra. As we explore the operators of relational algebra, I’ll show how they correspond to SQL expressions.

The two most important relational algebraic operations are selection and projection. Selection filters the relation’s tuples based on whether they meet some criterion. In SQL, it’s the WHERE clause that we all know and love:

1 WHERE Employee.salary > 32000

Once we have the desired tuples, we apply a projection to choose elements from the tuples. I like to think of it as shining a light from behind the tuple to project some of its values on to a wall or screen:

1 SELECT name

Remember: selection chooses the tuples from the relation and projection chooses the values from the tuples. Altogether our query is:

1 SELECT name
2 FROM Employee
3 WHERE Employee.salary > 32000

The output shows who the highly-paid employees are:

1 name
2 -----------
3 Amy Johnson
4 Hafsa Hasad

When working with relational databases, an important “woah, it’s all connected, man” insight is that each relational algebraic operation takes relations as input and generates new relations as output. Those output relations then form the input for other operations, generating yet more relations. Look at how the above query might generate intermediate relations:

Operators take relations and output relations

The intermediate relations only exist in the abstract. The database system might not actually physically generate them if it can find a more efficient way to generate the correct result. In the example above, the system might determine that it can do everything in a single pass by extracting just the name whenever it finds a highly paid employee. This would be much more efficient than first creating a temporary table of highly paid employees and then iterating through that to get the names.

Such considerations are mere implementation details. When writing a query, it’s best to think of operations on relations. It’s much easier to express complex things in SQL when you think of terms of relational operations. It’s very common, indeed good practice, to see queries forming a sub-query in a larger query. For example, the inner query below generates a relation containing the average employee salary and then feeds that to an outer query that finds every employee paid less than the average:

1 SELECT name, salary
2 FROM Employee
3 WHERE salary < (
4   SELECT AVG(salary)
5   FROM Employee
6 );

Another very useful operation is to take one relation and join it to another. A natural join takes two relations and tries to join the tuples together on matching attributes. Joining on a foreign key works in this way, since the foreign key of one tuple should match the primary key of a tuple in the other relation. A variation on this idea is to only join those tuples that match some predicate. Joining creates a new tuple with the attributes from both relations. It’s the relational algebra equivalent of a logical AND operation (think back to the circuits of the architecture chapter). The natural join in SQL is known as an inner join:

Combining rows with joins

Here we see that the only Department tuple matching Amy Johnson is the one with the corresponding department_id, which is what we would expect. Sometimes, though, you want to show all tuples, whether they have a match or not. An outer join shows every tuple from one relation with values from a matching tuple if one exists in the other relation or empty values if not.

An outer join

If you want to get a list of all customers and any purchases they’ve made, then you need a left outer join so that customers without any purchases are included. This join takes every tuple from the relation on the left of the join (Customer, in this case) and creates a new tuple for that customer and each purchase they made including if they made no purchases. In the example above, you can see that Clare Saves has been true to her name and made no purchases. She still appears in the results. If we’d used an inner join here, Clare would not appear as there are no matching tuples in the Purchases relation.

This entails the existence of some kind of empty value that expresses the absence of any value at all. In SQL this is called NULL. SQL uses triple-value logic: true, false and null. If you come from the world of JavaScript, you might think that NULL implies some kind of falseness. It does not. False is a value. NULL is not a value.

The power of relational algebra comes from the fact that it offers a set of operations that are powerful enough to allow the application developer to work their magic but limited enough to allow database systems to create very efficient implementations. A downside is that you need to be able to express your application’s data in terms of relations. When starting a new project or adding new functionality to an existing one, it is very important to sit down and design the required schema: what tables do I need, how do they relate to each other, what constraints should I apply?

One of the most interesting things about SQL is that it is a declarative programming language. When writing a SQL query, we tell the database engine what results we want in terms of relational algebraic operations (using FROM, WHERE, SELECT and so on). We give absolutely no instruction to the database system about how it should actually go about generating the results. This is very different to standard programming, where we normally tell the computer exactly what to do. As we’ll see later, the database system is responsible for taking the query and finding an efficient way to generate the results.

SQL’s declarative nature is one of the things that cause people problems. It’s a very different way of thinking about programming. Another snag is the structure of SQL queries themselves. True to its declarative nature, SQL was designed to read like an English instruction: select these attributes from these tuples where the tuple matches these criteria. Unfortunately, the logical order in which the database engine analyses the query is different to the lexical order in which it is written. This leads to confusing errors where the system complains that it doesn’t understand a reference you’re using even though the reference is right there, goddammit.

We will shortly follow a sample SQL query through the innards of a database system. Before beginning, let’s understand the logical ordering of the query and how it might be expressed in relational algebra.

Executing a SQL query

This flowchart describes the logical order of SQL. It’s very important to remember that the logical ordering is different to both the lexical ordering and the order in which the database system actually executes the query. The system is free to reorder the query into a more efficient form as long as it returns the correct results.

Our example schema models a ticket booking system. Users buy tickets for events. A user can have many tickets but a ticket has only one purchaser. This is a one-to-many relationship. The event is represented by the event_name attribute in the ticket. A more realistic schema would have a separate Events relation but I want to avoid cluttering the example with too many joins.

 1 CREATE TABLE Users(
 2   id integer primary key,
 3   first_name string,
 4   surname string
 5 );
 6 
 7 CREATE TABLE Tickets(
 8   id integer primary key,
 9   price integer,
10   purchaser_id integer references Users,
11   event_name string
12 );

Our sample query will return every customer and the total they’ve paid for each event they have tickets for:

1 SELECT
2   users.first_name,
3   users.surname,
4   tickets.event_name,
5   SUM(tickets.price) AS total_paid
6 FROM users
7 LEFT JOIN tickets ON tickets.purchaser_id = users.id
8 GROUP BY users.first_name, users.surname, tickets.event_name;

The written SQL query begins with SELECT. From what we know of relational algebra, this is a bit odd. A SELECT is a projection, meaning that it extracts some subset of attributes from the tuples. We need to have tuples before we can do a projection! Logically, the system actually starts by retrieving the tuples from the relation specified in the FROM clause. These are the raw input that will be fed into the query pipeline. Joins are also processed as part of this step. This means that the joined tuples will contain all attributes from the joined relations.

After this, the WHERE clause applies a selection operation to filter out unwanted rows. In the WHERE clause, we can only reference things that are specified in the FROM clause. For example, I might try and amend our query to only show customers who have spent more than £50 on an event:

1 SELECT
2   users.first_name,
3   users.surname,
4   tickets.event_name,
5   SUM(tickets.price) AS total_paid
6 FROM users
7 LEFT JOIN tickets ON tickets.purchaser_id = users.id
8 WHERE total_paid > 50
9 GROUP BY users.first_name, users.surname, tickets.event_name;

In SQLite, this fails with the cryptic error:

1 Error: misuse of aggregate: sum()

The problem is that total_paid is defined in the SELECT clause. We haven’t got to that yet so total_paid is not yet defined. You might think we can get around this by using the aggregate function SUM directly in the WHERE clause. This won’t work either because at this point in the query execution we are still selecting which tuples we want. An aggregate function generates a value from multiple input tuples. It stands to reason that we can only perform an aggregation operation once we have a set of tuples to operate on. What might seem like a strange error makes perfect sense when you understand the logical ordering of the query.

Next comes the optional GROUP BY. It specifies how the selected tuples should be grouped together. It’s only required because we are using SUM. Here we want to calculate the sum per user per event. If you’re not sure which columns to group by, bear in mind that only columns mentioned in the GROUP BY clause or an aggregate function can be used in SELECT. After GROUP BY comes the optional HAVING clause. It’s a selection that takes the grouped tuples as input. At this point, aggregate functions are available so we can fix the broken query above:

1 SELECT
2   users.first_name,
3   users.surname,
4   tickets.event_name,
5   SUM(tickets.price) AS total_paid
6 FROM users
7 LEFT JOIN tickets ON tickets.purchaser_id = users.id
8 GROUP BY users.first_name, users.surname, tickets.event_name;
9 HAVING SUM(tickets.price) > 50;

Yes, we need to duplicate the SUM function. This is because, as we saw above, total_paid is not available yet and so we can’t reference it. In practice, there’s actually a good chance that total_paid would work because many database systems allow the use of aliases (defined with AS) here as a convenience. They’re smart enough to know that you’re referring to something in SELECT and go get it from there. Don’t rely on this behaviour. It’s non-standard and may not work on other database systems.

Next comes the SELECT clause. It specifies which attributes we want to project into the temporary results relation. Here we’re projecting the user’s first name, surname, event name and computing the total paid by that user for the event. We alias the sum using AS to give it a name. Once the new attributes have been projected with SELECT, the tuples can be ordered with ORDER BY and the number of results limited with LIMIT.

I often find it easier to write a SQL query in the logical order. I start with a placeholder SELECT * (i.e. project everything), then work out which relations I want, then how they’re joined, then how to filter them using WHERE and finally, once I’m confident that I’m getting the correct results, I go back and specify which columns I want in SELECT. Returning all of the columns often shows erroneous results that would have gone unnoticed if I had already cut out most of the columns.

Database architecture

Now that we understand how a query is logically executed, let’s look at how the database system actually works. From previous chapters it should be obvious that computer scientists and programmers simply love layering things on top of each other. Out of bare transistors we create the abstraction of logical gates. From these we make components. These in turn are arranged to create processors and memory. A processor exposes an instruction set architecture that forms an abstraction between the user and the hardware. On top of all this we build the operating system, which adds yet more abstractions between the hardware and user programs. A database system is a user program. Amusingly enough, the database system is itself structured as another layer of abstractions.

A typical database architecture

At a high level, a database system consists of a front end and a back end. The front end takes a computer program specified as a SQL query, parses it, analyses it and generates a plan to execute it correctly and efficiently. The back end is responsible for executing the plan and ensuring that data is efficiently and durably stored. It contains the sub-systems that interact with the underlying operating system. The terms “front end” and “back end” are analogous to their usage in web development, where front end converts user actions into HTTP requests and the back end executes the requests. You’ll also see in the chapter on compilers that database systems have some interesting parallels with how compilers convert source code into machine instructions. It’s useful to think of database systems as programs that run other programs in the form of SQL queries against data. They do this while always maintaining the ACID properties. This is a very cool engineering achievement.

In the beginning, a client makes a connection to the database system. The mechanism varies. Postgres runs a server process listening on TCP port 5432 by default, so we are back in networking territory. SQLite is file-based and so opening a connection simply means opening the database file.

Different database systems handle connections differently. Some create a new worker process for each connection. The worker remains assigned to the connection for as long as it is open and handles every query from that connection. This is how Postgres behaves. An alternative approach is to maintain a pool of workers. An incoming query is assigned to the next available worker. Once the worker finishes the query, it is sent back to the pool where it waits to service the next query. In this model, one client might have each query executed by a different worker. Either way, the maximum number of available workers sets a limit on how many users can access the database concurrently.

We begin in the query compiler. This is part of the front end and is itself made up of a parser, planner and optimiser. It takes a string of SQL and outputs an execution plan. The first step is to parse the input string to work out what the hell it means. The worker analyses the string and generates a data structure known as a query tree. We’ll look at parsing in much more detail when we look at compilers, but for now it’s enough to understand that the parser turns this string:

1 SELECT
2   users.first_name,
3   users.surname,
4   tickets.event_name,
5   SUM(tickets.price)
6 FROM users
7 LEFT JOIN tickets ON tickets.purchaser_id = users.id
8 GROUP BY users.first_name, users.surname, tickets.event_name;
9 HAVING SUM(tickets.price) > 50;

Into a parse tree like this:

Example parse tree

The query planner takes the parse tree and works out which relations and operations on those relations are required. It generates a logical query plan:

Example query plan

The query plan for our query matches the logical order of SQL execution. It begins from the bottom at the leaf nodes, which represent the relations, and works upwards. Each node is a relational algebraic operator. Output flows up to parent nodes and eventually the final result is at the root.

Since SQL is declarative, our query doesn’t give the system any instructions about how to execute the query. The next step is to convert the logical query plan into a physical query plan. It superficially resembles the logical one but contains lower-level information that the system can use to actually retrieve data and perform operations. You can view the plan generated for a query by prefixing the query with EXPLAIN.

Even for very simple queries there will be many ways to physically compute the results. Each different way of putting together the data is called an access path. For example, if the system needs to scan through an entire table to find a particular tuple, how many tuples will it have to scan? Is there a valid index (described in more detail below) that might make things faster? If there are multiple indexes available, which is best? If multiple tables need to be joined together, in what order should they be joined and using which algorithm?

The query optimiser estimates the cost of each access path and generates an efficient plan as quickly as it can. For non-trivial queries just generating and assessing the possible access paths is a challenging task in itself. As an example, the number of join orderings increases factorially with the number of join relations. Database systems make the problem more tractable by only attempting to find a reasonably efficient plan reasonably quickly.

To execute the plan we move to the back end. The execution manager is responsible for carrying out the required operations. SQLite has an interesting implementation where the query plan contains instructions that are executed by a virtual machine. The idea is to abstract away from the details of the underlying hardware and file system.

The leaf nodes of the physical query plan do not contain data directly. Instead they tell the execution manager where it can retrieve the data. A separate storage layer is responsible for actually persisting the data in an efficient and safe way. The execution manager requests pages of data from the buffer manager. The buffer manager interacts with the OS to read and write data and usually maintains a cache of recently used pages to aid performance. The buffer manager is also responsible for preserving durability by verifying that writes to permanent storage are actually carried out.

The transaction manager is a component that wraps the query execution steps in transactions to preserve atomicity. Important points in the transaction are logged to persistent storage so that consistency and durability are preserved, including recovery in case of system failure. The transaction manager also implements concurrency control by scheduling transactions to preserve isolation.

B-trees

Database systems have to quickly search through very large amounts of data. The choice of data structure is therefore very important. A popular choice for database system implementations is the B-tree. It’s a specialised form of the tree abstract data type that we first saw in the data structures chapter. The “B” in B-tree does not stand for “binary”. It stands for “balanced” (or possibly the creator’s name – it’s a bit unclear).

Recall that in a binary tree each node consists of a value and two children. When searching for a value, you compare it against the current node’s value. If the searched-for value is greater, you move to the right child; if less you move to the left. Knowing which subtree to pursue makes lookup much faster because the ordering tells us which branch the value must be in. Database systems routinely store millions of rows in tables. At this scale the cost of moving through all of those levels becomes substantial. Unless the tree has been explicitly written to be cache efficient, it is likely that each node will reside in a different page on disk. Each step down a layer of the tree therefore requires reading a new page from disk and we know how slow that can be. Furthermore, the binary tree is pretty space inefficient. Each node needs space for its value and a pointer to each of its children. If we assume that the pointers and the value are the same size then our space usage grows at O(3n).

A B-tree is a generalisation of a binary tree. The data is stored in the leaf nodes, but rather than storing just a single value at each node we store many, potentially hundreds. The values are stored in the leaf in sorted order, allowing for a fast binary search within the leaf. Internal nodes consist of pointers to nodes sorted by their key – the highest value in the child node. To find a value, you traverse the root node’s pointers until you find one with a key greater than the searched-for value. Follow the pointer and repeat until you hit the leaf node. Scan or search through the node to find the value. A diagram might make things clearer:

A B-tree

The leaf nodes hold pointers to the previous and next leaf node so that they also form a doubly-linked list. Traversing the B-tree will always put you in the leaf node with the first instance of the searched-for value. You can then search for other matches by iterating through the leaf nodes in order.

The strength of the B-tree is that it very rarely needs to have more than a few layers. Finding a value requires correspondingly fewer page reads. The key factor is the number of entries in each node. With fifty entries per node, increasing the depth of the tree by just one means that fifty times more entries can be stored. The cost of keeping the B-tree balanced after modifications is insignificant compared to the increased lookup performance.

Indexes

Database systems store rows without any inherent ordering. To find a particular row, it’s therefore necessary to scan sequentially through the whole table. When I first started using databases, I found this kind of weird. It seemed like a really inefficient way of doing things. Even though database systems are very fast, it is still a slow way of finding a value. If all you have is an unordered set of rows, there’s not much else you can do, just as you can’t use a binary search on an unordered input. It doesn’t make sense to store the data in sorted order because then every INSERT or DELETE modification to the database would require shuffling huge numbers of rows around to preserve the ordering.

If we can’t maintain an ordering in the data itself, can we maintain an ordering somewhere else? Let’s look at an analogue example. The words in a textbook are alphabetically unordered. If you look through a textbook for every instance of a given word, you have no reason to assume that it will appear on one page and not on another. You need to search all of them. Unless, that is, the publishers have seen fit to add an index: a sorted list of words and the pages they occur on. Being sorted means that it’s easy to find a particular index entry and the page numbers point us directly to where we want to go.

A database index works in just the same way. It’s a sorted mapping of a value to the locations of every row in which the value occurs. Database systems use different implementations but a common one is a B-tree in which the leaf nodes hold the indexed values and the locations of each matching row. The index B-tree is entirely separate from the data B-tree. It will be much smaller than the data B-tree but still adds a space cost on the database system. The time benefit is that the index B-tree will probably be small enough to keep in memory. The only disk accesses required are to retrieve the data pages specified by the index. Keeping the smaller index B-tree sorted is much easier than keeping the entire table sorted.

Using indexes is probably the simplest and most effective database optimisation you can perform. An index is created for a particular set of columns, possibly including a condition. If you know that you’re likely to query a particular column a lot, you can simply add an index when creating the table. Otherwise, you can use EXPLAIN to uncover queries that might benefit from an index. By itself, EXPLAIN shows only an estimated cost for a query. In Postgres, EXPLAIN ANALYZE actually runs the query and shows the true cost.

Let’s modify our earlier query to limit our results to tickets bought by someone with a particular name:

 1 SELECT
 2   users.first_name,
 3   users.surname,
 4   tickets.event_name,
 5   SUM(tickets.price)
 6 FROM users
 7 LEFT JOIN tickets ON tickets.purchaser_id = users.id
 8 WHERE users.first_name = 'Tim'
 9 GROUP BY users.first_name, users.surname, tickets.event_name;
10 HAVING SUM(tickets.price) > 50;

Here’s what Postgres estimates for our query (using EXPLAIN) when performed on tables containing millions of users and tickets:

 1 GroupAggregate  (cost=14543.59..14543.62 rows=1 width=47)
 2   Group Key: users.first_name, users.surname, tickets.event_name
 3   Filter: (sum(tickets.price) > 50)
 4   ->  Sort  (cost=14543.59..14543.60 rows=1 width=43)
 5      Sort Key: users.surname, tickets.event_name
 6      ->  Nested Loop Left Join  (cost=1000.00..14543.58 rows=1 width=43)
 7        Join Filter: (tickets.purchaser_id = users.id)
 8        ->  Gather  (cost=1000.00..14541.43 rows=1 width=35)
 9          Workers Planned: 2
10          ->  Parallel Seq Scan on users  (cost=0.00..13541.33 rows=1 width=35)
11            Filter: (first_name = 'Tim'::text)
12        ->  Seq Scan on tickets  (cost=0.00..1.51 rows=51 width=16)

The output is a little easier to read when you think back to the query plan diagram. Starting from the bottom, Postgres performs a sequential scan on users and tickets, joins them, sorts by user and ticket name and then groups and filters. The important thing is that without an index on the users table, Postgres has no alternative but to sequentially scan every tuple to find a match. The output of EXPLAIN ANALYZE is too verbose to reproduce here, but it shows that the execution time is about 63ms. Since the sequential scan has such a high cost, let’s see if things are improved by adding an index:

1 CREATE INDEX first_name on USERS (first_name)

When we analyse the query again, the query planner detects that a valid index is available. It estimates the cost of a query plan using the index and, finding that it’s much faster, selects it:

 1 GroupAggregate  (cost=10.11..10.14 rows=1 width=47)
 2   Group Key: users.first_name, users.surname, tickets.event_name
 3   Filter: (sum(tickets.price) > 50)
 4   ->  Sort  (cost=10.11..10.11 rows=1 width=43)
 5     Sort Key: users.surname, tickets.event_name
 6     ->  Hash Right Join  (cost=8.46..10.10 rows=1 width=43)
 7       Hash Cond: (tickets.purchaser_id = users.id)
 8       ->  Seq Scan on tickets  (cost=0.00..1.51 rows=51 width=16)
 9       ->  Hash  (cost=8.44..8.44 rows=1 width=35)
10         ->  Index Scan using first_name on users  (cost=0.42..8.44 rows=1 width=35)
11           Index Cond: (first_name = 'Tim'::text)

The query optimiser notices that there is a suitable index and chooses it (line 10), greatly reducing the time taken to scan users. The overall execution time drops from 63ms to 0.1ms! This huge performance boost is why indexes are so important. Always think about what indexes you should have. If you’re not sure, analyse your queries to inform your thinking.

Concurrency control in SQLite and Postgres

The simplest way to write a database system is to only allow one user at a time. In many situations, this is perfectly adequate but it clearly won’t scale very well. Imagine that you have a website running as a standard Rails app backed by a database. Each visitor is handled by a different process. If each process has its own connection to the database, only one user at a time will be able to access your website. Clearly inadequate! What we want to do is handle multiple, concurrent users.

Concurrency is an amazingly interesting topic that we cover thoroughly separately. Concurrency is challenging for database systems because they hold state. We can’t just give each user their own copy of the database because then we’d have conflicting versions, to say nothing of the cost of duplicating the data. Database systems must implement very sophisticated concurrency systems to allow multiple users to operate on the database without causing conflicts or losing data.

The isolation property requires that partial changes are invisible to other concurrent users. Database systems achieve this via transactions. Every read/write operation on the database is wrapped in a transaction. It begins before the first change is made and is committed, or finished, when the system verifies that all modifications have been written to permanent storage. From within a transaction, a user can only see the state of the database at the point when the transaction began. Once it commits, a transaction’s changes become visible to subsequent transactions. This preserves isolation between users.

The gold standard of concurrent usage is known as serializability. A set of concurrent events are serializable if there exists some sequential ordering that achieves the same result. Let’s now compare how SQLite and Postgres handle concurrency and durability.

SQLite uses locking to implement concurrency control. A lock is a straightforward way of claiming access to a resource. SQLite’s default, persistent, transaction log is called the rollback journal. It’s a form of undo logging. Before modifying anything, the original value is written to the rollback journal. If the system needs to recover from a failure, it can roll back any uncommitted transactions by writing the original values back into the database.

Before reading from a database file, the user’s process tries to acquire a shared read lock. As the names suggest, multiple users can have a shared read lock on the same resource. When a user wants to modify a resource, it must first acquire a modify lock. This declares the intention to modify the resource. Any shared read locks are allowed to continue but no new ones are permitted, guaranteeing that the modifying process will eventually have exclusive access. The modifying process now creates a separate rollback journal file containing the original version of the pages that will be modified. SQLite requests that the OS flush the file to disk to avoid any caching layers and write straight to persistent storage. Once the journal file is in place, the modifying process can make its changes to the database pages. At first, the changes will only exist in the process’s address space and so other reading processes will still see the original data in the database file. When the modifying process is ready to commit, it claims an exclusive write lock, preventing anything else from accessing the resource. All changes are written to the database file and flushed to disk. The transaction is committed by deleting the journal file. Finally, any locks are released.

Rollbacks and crash recovery are very straightforward. When SQLite starts, it checks for any journal files. The presence of a journal file indicates that a transaction started but did not commit. The pages in the database file contain potentially invalid data. The system performs a rollback simply by writing the pages from the journal file back into the database file, undoing any modifications performed by the transaction.

Locking is a pessimistic form of concurrency control. It assumes that conflicting operations happen frequently and so takes preventative action to stop them from ever happening. If conflicts happen less frequently than anticipated, we’re wasting lots of time obtaining all these locks that aren’t often needed. Optimistic concurrency control allows operations to proceed as normal on the assumption that there will rarely be conflicts. Only when a conflict actually occurs does it take corrective action.

Postgres has an optimistic form of concurrency control known as multi-version concurrency control (MVCC). It works by maintaining different versions of the same resource. It uses complex rules to determine which version to show a particular transaction to maintain transaction isolation. To give you just a rough idea of how things work, each transaction is assigned a transaction ID in ascending order. Each version of a resource is tagged with the ID of the transactions that created and last modified it. Each transaction works with a “snapshot” of the database’s state at the time it began. A transaction cannot see versions of a resource made by transactions with a higher ID. From the transaction’s perspective, these are changes that happen in the future because the transaction with a higher ID must have started later.

Postgres uses write-ahead logging (WAL), which is a form of redo logging. Rather than store the old data, as with undo logging, the new versions are written to the log. Transactions are committed when a special commit record is written to a separate commit log. Committed changes are periodically written into the database and removed from the WAL at checkpoints. If the system needs to recover from a failure, it works its way through the WAL, redoing all of the changes made by committed transactions not yet written into the database. All committed transactions are restored and uncommitted ones ignored. The main benefit of WAL is that writes happen outside of the main database so reads and writes can proceed concurrently. SQLite also offers write-ahead logging as an option.

Conclusion

Databases are incredibly sophisticated pieces of software. They store data while preserving atomicity, consistency, isolation and durability (ACID). The relational data model, based on relational algebra, is very popular. It represents data as relations and queries as operations on relations. A database system works like a mini computer, taking a declarative SQL statement and generating an efficient execution plan. A typical database system is made up of many sub-components. The transaction, a sequence of operations that must be performed atomically, is the key abstraction that helps the database system to preserve the ACID properties. Transactions take the database from one consistent state to another. Concurrent transactions do not see each other’s partial changes. Durability is achieved by writing logs to persistent storage so that the correct database state can be recovered in the event of system failure.

Further reading

It’s difficult to find really good resources for databases. All of the textbooks I’ve read are really long and I don’t think they’re a good starting point. Designing Data-Intensive Applications by Kleppmann focuses more on distributed data systems but it does include well-explained sections on data querying (including SQL), storage and retrieval. I’d recommend it as the best overview of modern, web-scale data systems.

Stanford’s Jennifer Widom offered a course on databases that was one of the first MOOCs. It’s now been expanded into a complete program of introductory database courses. I recommend doing at least the “data models”, “querying relational databases” and “database design” modules. If you’re still interested you can do the advanced topics.

Architecture of a Database System is a short and readable overview of – you guessed it! – database system architectures. My firm belief is that having a good understanding of a database system’s internals helps you to utilise it more effectively.

SQL Performance Explained by Markus Winand is chock full of advice on writing performant SQL. While it’s true that you don’t need to worry about performance right away, basic things like indexes are table stakes optimisations that even junior developers should be comfortable using. Much of the basic advice in SQL Performance Explained is available for free on the sister website Use the Index, Luke. You can see the main thrust of the advice!

Out of traditional database textbooks, Database Systems: the Complete Book by Garcia-Molina, Ullman and Widom was my favourite but I would only recommend it for those who are very interested in databases.