Contents

Distributed systems

Introduction

A distributed system is a collection of computers that present themselves to the user as a single, coherent system. The components of the system are spread over multiple computers that communicate over a network. The benefit of distributed systems is that they allow us to achieve levels of reliability and performance that are unobtainable on a single machine. The problem is that it is very difficult to design and run distributed systems correctly. They introduce whole new classes of sometimes very subtle bugs that can make it hard to reason about how the system operates. Leslie Lamport, a computer scientist renowned for his work in distributed computing, wryly describes them thus: “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.”

By their very nature, distributed systems force us to think about the physical limitations of time and space. Latency is the time it takes for a result to become apparent to an observer. When you click a hyperlink, it takes some time for your request to reach the server. During that period of latency, the request has been initiated but it is not yet visible to the server. Communication, even at the speed of light, can never be completely instantaneous and so some latency is inevitable. We know from the architecture chapter that latency within a single machine needs to be handled carefully but for the most part we can leave it to the computer architects to worry about. In a distributed system, spread over multiple machines possibly at great physical distances from each other, the effects of latency are too obvious to ignore. Much of the challenge of distributed systems comes from understanding and managing the effects of latency. As we’ll see, the more a distributed system tries to behave like a single-node system, the greater the latency cost.

Distributed systems are increasingly common and some familiarity with them is essential for web developers. Simply put, all web work involves distributed systems to some degree. A web application running in the client’s browser and communicating with a back end service forms a distributed system. Front end developers need to understand the problems latency introduces and how to mitigate them. The back end service itself is probably made up of at least an application server and a separate database server. Even that simple design is a distributed system. A production-ready architecture will probably have multiple services, load balancers, caches and who knows what else.

In this chapter we’ll first review why distributed systems are so useful. We’ll see what kinds of problems they can solve and the new challenges they bring. We’ll then build a simple conceptual model of a distributed system and use that to determine the theoretical limitations of distributed computing. We’ll then explore a few distributed system architectures via a discussion of consistent replication. Maintaining a consistent view of state across the whole system is essential for many use cases and so understanding the challenges of consistent replication is a good entry point to thinking about distributed systems.

Why we need distributed systems

Why do we now find ourselves surrounded by distributed systems? In large part it’s because we users have become so demanding. We used to be content with a PC sat in the corner by itself. Then we got the Internet and wanted to be able to talk to other computers. Fast forward a few more years and now we have multiple, mobile computing devices and we expect all of them to remain in sync all of the time. What we want to do online has become more demanding too. Back in the 1990s, web services could get away with having regularly scheduled downtime to perform updates. An unexpected outage would be irritating but not critical for many businesses. In our modern, globalised world, we expect online services to be always available at any time. We want to be able to search the entire Internet in seconds and download any content rapidly whenever we feel like it. The sheer scale of modern demands makes distributed systems essential. Below are the three reasons distributed systems help.

1. More machines can handle more

Tasks are easier when they can be handled by one machine. Sorting an input is a relatively straightforward job when the input fits in memory. Running a web server is easier if a single machine can handle all of the requests. When the problem becomes too big for a single machine, we need to change our approach.

Scalability is the ability of a system to cope with increased demand. We’ll take “demand” to have a fairly loose meaning of “the work to be done”. That might refer to a single, large job, such as indexing the web, but it could also mean a large number of small jobs such as requests to a web server. The first solution is to scale up: get bigger and better hardware so that one machine can handle more. This works very well but only up to a certain point. If you’re trying to index the entire web then no amount of scaling up will do. No single machine, no matter how beefy, can handle the task. The second solution is to scale out: add more machines to the system. Now you have multiple machines coordinating to solve a problem. Congratulations – you have a distributed system!

Distributed systems can handle huge workloads by dividing them into smaller tasks and spreading those tasks across multiple machines known as nodes. They can cope with increased demand by scaling out. This makes them more flexible and probably more cost effective. For example, online retailers predictably experience huge spikes in demand at certain points in the year. The spikes require a high server capacity to handle the load. It would be wasteful to provision the same capacity throughout the rest of the year when demand is predictably much lower. This is the problem with scaling up. A server powerful enough to handle the huge, albeit brief, spikes would spend most of its time working at a small fraction of its full capacity. That’s a waste of resources. If the retailer chose to scale out instead, they could provision only the capacity they needed and easily add extra machines when required.

2. More machines improve performance

We know from the concurrency chapter that some problems are parallelisable: they can be solved more quickly by having multiple machines solve parts of the problem in parallel. So for tasks that are easily parallelisable, some form of distributed system will deliver a performance boost.

A different performance benefit comes from globally distributed systems. Remember that communication is never instantaneous and transmission delay has a lower bound determined by the distance between the two nodes. By hosting the same data in multiple data centres dispersed around the globe, a system can serve the user’s requests from the physically nearest data centre. Shorter distances result in lower latency and faster response times. Big content providers such as Netflix do this to make their services appear faster and more responsive. Having the same data in multiple locations introduces the challenge of maintaining consistency across all of those data sources.

3. More machines increase resilience and fault tolerance

Global web services don’t have the luxury of regular maintenance downtime. They are in constant use. For many online businesses, a brief outage could cost millions in lost revenue. Even non-tech companies would be hobbled if their internal systems became unavailable.

We want our systems to be available for use and functioning correctly as much as possible. A fault is when one component of the system (e.g. a node’s hard drive) stops working correctly. Faults are sadly inevitable. Software contains bugs and hardware eventually breaks. What we can control is how we respond to faults. Resilience means that the system can recover from faults. The user may observe some sign of failure but the rest of the system continues to work. Resilience itself doesn’t require a distributed system. A simple example is a web application running on a single server. It might return an error response to one request but continue accepting and correctly processing other requests.

Distributed systems enter the scene when we aim for high availability. It’s an imprecisely defined term but simply means that the service tries to have very little downtime (i.e. periods of unavailability). Normally this involves having redundant components arranged in a distributed system so that there is no single point of failure. Fault tolerance is an even more stringent requirement. It means that the system can continue correct operation even if some components fail. The user doesn’t see any sign of failure, apart from maybe some increased latency. This obviously requires some form of distributed system, since a single node cannot provide fault tolerance.

The problem for designers of distributed system is that more hardware means more complexity and more places for things to go wrong. Most hardware failures are relatively unlikely for any given machine, but when you are running thousands of machines they become almost inevitable. Stats from Google indicated that a new Google cluster might experience thousands of individual machine failures in a single year. Worse, entirely new classes of errors arise due to the interactions of so many components. Faults can cascade across the system in unexpected ways. We cannot achieve fault tolerance simply by throwing more hardware at the problem. True fault tolerance is difficult to achieve and requires a lot of careful planning.

A theoretical model

In this section we will define a basic, theoretical model of a distributed system. We’ll use it to explore the distinguishing characteristics and limitations of distributed systems in general. This is analogous to how we used automata to analyse computation way back in the theory of computation chapter. Our model takes the form of a set of assumptions about the system’s environment and components. The assumptions are deliberately very simple to make analysis easier.

We assume that a single program is executing concurrently on multiple, independent nodes. Each node has its own local, volatile memory and persistent storage. It also has its own internal clock. The program running on each node allows the user to read and modify a single, system-wide variable. This is known as a “register” after a processor register. “Independent” here means that each node can fail and reboot without affecting the others.

Nodes communicate by passing messages over a network. Messages may be delayed or lost. There is no upper bound on transmission delay. The network therefore introduces non-determinacy into the system because we cannot guarantee the order in which messages arrive. We can model the network by giving each node a first-in-first-out (FIFO) message queue. “Sending” a message to a node is equivalent to adding the message to the node’s message queue with no guarantee when or whether the message will be successfully added. The structure of the network is not important as long as every node can communicate with every other node.

We need to consider what kind of faults can occur in this system. We have already assumed that the network may arbitrarily delay a message. We will also assume that a node fault means that the process crashes and the node does not respond to requests. This, surprisingly, is the easy assumption. Far more insidious are partial faults where a node fails in some way but still responds to requests. It might return erroneous data, delay for an inordinately long time or appear to have correctly accepted a request but actually failed silently. A fault model that allows for components to fail with imprecise information about their failure is known as a Byzantine fault model, after the Byzantine Generals thought experiment from the networking chapter. Recall that neither general knows whether their messages were successfully delivered. That’s an example of a partial fault. The Byzantine fault model most closely matches reality. Our assumption of a crashing fault is a simpler fault model. Keep in mind that real systems have to handle much more complex faults.

To recap, we have independent nodes running the same program communicating over a fallible network. A consequence of each node having its own local clock is that the system as a whole does not have a consistent view of the current time. Each node has its own, local perception of time. This has really significant implications. Clocks are important because by timestamping operations we can put them in order. In a system where each node has its own clock, timestamps are only comparable when they come from the same node. That means any ordering of operations must be local to a node. We can’t guarantee that any two nodes’ clocks are in sync and therefore we can’t compare timestamps between nodes. If one node’s clock is even a little bit out of sync with the others, that node will perceive a different ordering of events. We as observers can see the “correct” ordering but that is only because we implicitly assume a global clock that the system doesn’t have.

In the diagram above, node B has a clock skew (i.e. error) of three seconds. By comparing the relative positions of the events against the global clock on the right, we can see that the true order of events is: A1, B1, A2, B2, A3. But because of the skew, node B is writing incorrect timestamps. The order reported by the timestamps is: A1 and B1 concurrent, B2, A2, A3. Such discrepancies can easily lead to errors if an observing node applies the operations in an incorrect order.

So we can’t rely on each node’s system clock being accurate. Why can’t we use some kind of centralised clock server that provides the system with a global time? While time servers do exist, the speed of light means that each node will have a slightly different perception of that clock. Imagine that the global clock broadcasts to every node the message “it’s now 12:00pm exactly”. Since each node is at a slightly different distance to the time server, plus random network fluctuations, each message will experience slightly different amounts of latency and so will be delivered to each node at slightly different times. Any other non-clock ordering system we might come up with would still face the same latency problem. We must therefore assume that messaging is asynchronous and make no assumptions about timing.

A further detail of our model is that nodes have fast access only to the state in their local memory. While a node can request state from other nodes, it can never be sure that the state hasn’t been updated while the response is in transit. Any “global” state shared between nodes is potentially out of date. Knowledge is therefore always local.

Let’s look at our network. It is fallible and so messages might be delayed or fail to arrive. A network partition happens when a network error cuts off healthy nodes from the rest of the system. The network is partitioned into separate sections. Network partitions make it difficult to reason about error states. How does a node determine whether a message it’s sent has been lost by the network or merely delayed? How does the node distinguish between a node failure and a network failure? A common design technique is for nodes to send a “heartbeat” message to each other to check that they’re still available and reachable. If node A sends a heartbeat message to node B, for how long should it wait for a response before deciding that node B is unreachable? There is no obviously correct answer.

Handling network partitions: the CAP theorem

A distinguishing characteristic of distributed systems is how they respond to network partitions. Eric Brewer in 2000 presented a simple idea: a distributed system cannot be simultaneously Consistent, Available and Partition tolerant:

This idea was formalised into the CAP theorem in 2002 by Gilbert and Lynch, taking its name from the first letter of each property. The theorem assumes a system model very similar to our own, except it has the further simplifying assumption that nodes don’t fail. We can use the theorem to determine the limitations of our own model. The theorem proves that when partitions occur, the system cannot preserve both consistency and availability. It’s a useful model for thinking about how distributed systems behave when network problems occur but it does have some limitations, which I’ll deal with after we take the three sides of the triangle in turn.

Consistency, in an informal sense, means that the system agrees on the current state. In an ideally consistent system, an update to a value would be applied to all nodes instantaneously. The system would behave identically to a single-node system. We already know, however, that “instantaneous” is not possible in reality. It takes at least some time for the update to propagate to the other nodes. Therefore what we must settle for is consistency after some time period that doesn’t unduly affect users. A variety of consistency models can be created depending on which orderings of events you consider permissible. Very roughly speaking, the “strength” of a consistency model measures how close it gets to the ideal. In the further reading section, I’ve included a reference to a map of consistency models that contains sixteen different models (don’t worry, you don’t need to know them all!).

In the definition of the CAP theorem, consistency has a precise meaning: every read of a value must return the most recently written value (or an error). It’s not instantaneous but, effectively, when queried, all nodes will agree on the same state for a given value. This particularly strong form of consistency is known as linearizability and we’ll cover it in more detail below.

Availability is a little simpler. It means that a request to a live node eventually generates a non-error response. Informally, the system “works”, although possibly with delays. Note that there is no requirement that the response returns the latest value. It’s possible that the system will return stale (i.e. out of date) reads or even lose writes. We can see that availability does not require consistency.

Finally, partition tolerance. This one is poorly named. As we know, a partition occurs when a network failure breaks the network into separate sections with no communication across the partition. Messages from nodes in one partition never reach nodes in the other. Partition tolerance means that the system will continue to process requests even if a partition occurs. According to how we have modelled our system, partition tolerance means that the system will continue to function despite an arbitrary number of messages failing to appear in their destination node’s message queue.

Why do I say “poorly named”? Well, flakey networks are an unfortunate fact of life. A designer of a distributed system doesn’t have the luxury of deciding that partitions won’t occur. They have to decide how the system will respond when one does occurs. This leads us to a common misunderstanding of the CAP theorem. You will often see the theorem summarised thus: “you can have any two of consistency, availability and partition tolerance. Take your pick!” This is incorrect and it’s obvious to see why if we take a simple example:

A partition has occurred and separated node A from nodes B and C. Nodes B and C can communicate with each other but not with A. Should A continue to serve requests? If it refuses to serve a request, then we have a live node returning errors and so we have given up availability. Should it respond to a read request, it has no way of knowing that B and C haven’t accepted a newer write. If the user sends a write to C and then reads from A, the system will return a stale value, thus breaking consistency. Even worse, if A also accepts writes, we could end up with two divergent “histories”. B and C think the current state is one value and C thinks it’s a different value. This is known as split brain and is very difficult to fix. It’s a bit like the worst merge conflict you can imagine but far, far worse. Horrific, news-worthy outages often involve split brain.

Partitioned nodes both accept writes, leading to divergent histories

The lesson from the CAP theorem is that when partitions inevitably occur, we must choose between availability and (linearizable) consistency. We can have consistency with partition tolerance or availability with partition tolerance but not consistency with availability and certainly not all three. The question for the system architect (i.e. you, dear reader) becomes: which of consistency and availability is more important?

Before answering that, we must discuss the limitations of the CAP theorem. A theorem proves that when certain preconditions hold, the theorem’s conclusions must be true. If we deviate from those preconditions, the theorem does not necessarily still apply. The conclusions only apply when using the exact same definitions and assumptions as the proof. The CAP theorem’s apparently binary choice between consistency and availability only holds when we are talking about linearizability because that is how the theorem defines consistency.

Similarly, the CAP theorem’s definition of availability only considers what happens when requests reach nodes within the system. A practical response to a network partition would be to re-route traffic away from the nodes in the minority partition to nodes in the healthy partition. From a user’s perspective, the system would continue to work correctly. Nevertheless, the CAP theorem would not consider this system to be available because its definition of availability only considers what happens when a request actually reaches a node. In practice, weaker consistency models permit more availability because there are fewer constraints on the responses they can provide.

What the CAP theorem is telling us is that stronger consistency implies less availability and weaker consistency permits more availability. The question is not whether we want consistency or availability, but which do we want more of.

If the distributed system holds business-critical data, we probably value consistency over availability. It would be very troubling if an invoice appeared to be paid one moment and unpaid the next. In that case, the system should preserve consistency by refusing to respond to requests if there’s a danger of returning stale values or losing data. We’ll see various implementations of that idea shortly. Sometimes, by contrast, availability is more important. A Twitter user probably doesn’t mind if their timeline doesn’t show absolutely up to date results so long as it shows something. When shopping online, it’s preferable to have a deleted item occasionally reappear in your shopping cart than have the website completely crash. Such systems will allow stale reads and maybe even lost writes as the price of greater availability.

To summarise, what the CAP theorem demonstrates is that no distributed system, even one as simple as our abstract model, can guarantee the same behaviour as a single-node system. Some deviation is an inevitable consequence of non-instantaneous communication. Availability and consistency are competing priorities and the correct balance depends on the intended use case. It’s important not to attach too much significance to the CAP theorem. While an important result, its applicability is limited to the model in its definitions and it’s not necessarily a good tool for thinking about real world systems.

Consistency models

In this section we’ll analyse a few common consistency models and their implications for availability. Consistent data replication is a central problem in distributed computing. It’s obviously critically important for distributed databases, one of the most common examples of distributed systems, and other systems will most likely have some kind of internal data store that needs to replicated consistently. For example, Kubernetes, a tool for deploying applications on computer clusters, uses a distributed store called “etcd” to hold the cluster state.

I must admit that the terminology in this discussion can be a bit confusing and the differences between consistency models are quite subtle. We’ll start with the strongest form of consistency and check out a few interesting models on our way down the strength scale. Generally speaking, the stronger the consistency model, the easier it is to understand but the harder it is to implement. Stronger models also have greater latency costs and are less available.

As an aside, “consistency” as used in the CAP theorem is totally different to the “consistency” in a database system’s ACID guarantees. ACID consistency only ensures that each transaction brings the database from one valid state to another. The “isolation” guarantee is actually closer to what the CAP theorem means by consistency. The strongest isolation level is “serializability”, which ensures that concurrent transactions always leave the database in the same state as if the transactions had been performed in some sequential order. This is a weaker guarantee that linearizability. The terminological confusion arises because distributed systems and databases were originally developed and studied separately. The database folks were concerned with transactions, groups of operations modifying multiple objects, while the distributed system people were talking about single operations on single objects distributed across multiple locations.

Let’s take our abstract model from the previous section and run a distributed key-value store on it. The two valid operations are read(key) and write(key, value), or r(k) and w(k,v) for short. The initial value for each key is 0. Operations belonging to the same node must occur in order according to the node’s clock. We assume that, due to communication delays, there is some period of time between an operation’s invocation and response. Operations that occur during the same time period are said to be concurrent and have no ordering between them. That opens the door to non-determinacy because concurrent operations may be interleaved in different orders, leading to different responses and final states. The system’s consistency model determines which interleavings are permitted. A stronger system allows fewer interleavings while a weaker model is more permissive.

Should the read return the old or the new value?

Linearizability

In simple terms, a linearizable data store appears to behave like a single-node data store. Nodes always agree on the current value. That’s very attractive because programmers are already familiar with how single-node systems work. We assume that each operation executes atomically at some point between invocation and response. Once the operation executes, it “takes effect” and every operation executing after that point must be able to see the effects. So once w(a, 10) executes, a holds the value 10 and all subsequent r(a) operations must return 10 (or a later value). Because later operations always see the latest value, every node agrees on the current state. Linearizability still allows for some non-determinacy. When two operations are concurrent, there is no ordering between them. A read that is concurrent with a write can return either the old or the new value, provided that the new value hasn’t yet been read.

In the diagram above, each operation is invoked at its leftmost point and responds at the rightmost. Time goes to the right. The vertical lines mark the moment in which each operation is atomically executed. You can draw a line from execution point to execution point. The system is linearizable if the line always goes from left to right (i.e. never backwards in time) and never reads stale values. The execution points could occur anywhere within the operation’s duration – this is just one possible arrangement. Clients A and C both have reads concurrent with B’s write. Since they’re concurrent, both reads could have returned either the old or the new value. But once client C returns the new value, client A must also return the new value to prevent a stale read.

An assumption in the formal definition of linearizability is that there is a global clock ordering every operation. When I breezily say above that “time goes to the right”, I’m implying that each operation in the diagram is positioned relative to this global clock. Now, we already know that a global clock is impossible and so in reality nodes can’t be sure exactly when operations execute on other nodes. What is important is that the system behaves as if the operations had been ordered by a global clock. There are a number of ways to implement this in the real world.

Google have come up with an interesting approach in their distributed database, Spanner. Rather than using timestamps to record a possibly inaccurate moment in time, Spanner uses an API called TrueTime to produce time ranges during which the operation definitely happened. To ensure that operation B happens after operation A, the system simply waits until the end of operation A’s time range has passed. Spanner received a lot of publicity because Google announced that TrueTime uses machines with fancy atomic clocks and GPS antennae to keep the interval of uncertainty as short as possible. Such specialised hardware is not required in principle. CockroachDB is an alternative implementation of the same idea that runs on commodity hardware. Google claims that Spanner offers “external consistency”, which is linearizability with additional semantics for transactions on multiple objects.

The only reason we only care about global clocks in the first place is because they give us an easy way to put operations in order. A second approach is to find other ways to create a system-wide ordering of events. We’ll see below how consensus protocols such as Raft achieve this.

Sequential consistency

If we remove the requirement that operations are ordered according to a global clock, we have sequential consistency. The order of operations must respect each node’s internal ordering but there is no required ordering between nodes. Once a node has observed the effects of an operation, it must not return stale state from before the operation. Informally, sequential consistency allows for an operation’s effects to take place after the operation completes (or even before its invocation, in theory).

The example above is not linearizable but it is sequentially consistent. Think of all the different ways you might interleave the operations while ensuring that each node sees its own operations occur in order. One consistent ordering would be:

B: w(x, 2)
C: r(x) 2
D: r(x) 2
A: w(x, 1)
C: r(x) 1
D: r(x) 1

Although A: w(x, 1) was the first operation according to the global clock, in this ordering its effects didn’t actually take place until after the first two reads had occurred. I find it helpful to visualise sequential consistency as an asynchronous queue that preserves logical order. Imagine a photo-sharing application popular with influencers trying to hide the emptiness of their existence. When an influencer uploads some photos of their lunch, the uploaded photos go into an asynchronous work queue for resizing and compression. The node that handles the uploading will see its own operations in order and so it will insert the uploaded photos into the processing queue in the same order as they were uploaded. It will take some unknown time for the effects of each successful upload operation to become visible on the influencer’s timeline, but the photos will appear in the upload order.

Linearizability and sometimes sequential consistency are normally what is meant by “strong” consistency. Sequential consistency is linearizability without the real time ordering. A strong consistency model will only show consistent state. A value will not flip flop between old and new versions. The appeal of strong consistency is that the system exhibits behaviour that is familiar and intuitive to programmers. The downside is that this consistency requires more coordination. Nodes need to spend a lot of time communicating to keep each other informed. That network chatter entails increased latency and a vulnerability to partitions.

Eventual consistency

A much weaker consistency model is eventual consistency. This only guarantees that, in the absence of further writes, every read will eventually return the value of the latest write. In the meantime, reads may return stale values. There is no illusion of a single data store. The system exposes its progression towards the eventual state:

Examples of eventually consistent systems include Amazon DynamoDB, DNS and Kubernetes. Clients should not expect to receive the latest data and it’s possible that reads will return fluctuating values. Compared to a traditional, single-node data store, this is very unintuitive. On the plus side, nodes in the system need to spend less time coordinating. A node receiving a write operation can perform it, immediately respond to the client and then asynchronously propagate the update to other replicas. That results in lower perceived latency for the user. However, they may see old state if they query a node that hasn’t yet received the latest updates. The system will also be more available in the event of partitions. Nodes can simply continue serving stale reads and catch up once the partition is fixed. Eventual consistency is appropriate when speed and availability are more important than exactly correct values.

Eventually consistent distributed databases are often said to offer BASE semantics instead of the traditional ACID guarantees (don’t you love chemistry puns?). What this means is:

Basically Available: read and write operations are available as much as possible but without any kind of consistency guarantees. Reads might be stale and writes might not persist after resolving conflicts.
Soft state: the state of the system might change over time, even without further input, as it converges on the eventual state.
Eventual consistency: over time, and without further input, the system will become consistent.

You can see that they had to stretch a bit to get the definitions to fit the acronym. BASE systems favour availability over strong consistency and often talk about making “best efforts” rather than guarantees. The NoSQL trend has in recent years seen a spate of new database systems offering BASE semantics. Whether eventual consistency is suitable depends on your use case. If in doubt, favour stronger consistency.

Consistency protocols

In the previous section, I defined consistency models in terms of permitted orderings of operations. That tells us how a consistency model behaves but not how it might be implemented. This is where consistency protocols come in. Consistency protocols implement consistency models. They are a contract between the node processes and the data store. If the processes implement the protocol correctly, the protocol guarantees that the data store will behave with the expected consistency. The weaker the consistency model, the easier the protocol is to implement in terms of complexity and resource requirements (primarily time). The protocols below offer strong consistency and can be tweaked to provide greater availability and speed at the cost of weaker consistency.

Primary-secondary replication

Also known as leader-based replication or master-slave replication. One node is designated the “primary” and the remainder are “secondary” nodes. The primary accepts read and write operations and the secondaries only accept reads. All changes are performed on the primary and then replicated to the secondaries. This makes it easy to keep the system consistent since all changes go through a single node. Should the primary fail, the system can automatically failover by switching a live secondary to be the new primary.

This architecture is a popular way to provide high availability for a database, provided that a brief period of downtime during failover is acceptable. A common work pattern for a database is to have many read requests and few write requests (imagine how many people read Wikipedia and how many edit it). As long as the number of write requests remains low enough to be handled by a single node, the system can easily scale to handle increasing read load by adding new secondaries.

The precise form of consistency on offer depends on how updates are replicated to the secondaries. If the replication is synchronous, the primary doesn’t respond until every secondary confirms receipt. This provides strong consistency because every secondary always maintains a full copy of the latest data. A failover would not cause any data loss because the new primary would have the same data as the old one. As you can see in the diagram below, waiting for confirmation from every node increases overall latency and a single slow secondary is enough to hold up progress for the whole system. Availability is also limited since the primary cannot accept a change unless every secondary acknowledges it.

Asynchronous replication provides eventual consistency. The primary responds to the client without waiting for every secondary to confirm receipt. As the diagram shows, this is faster because we don’t need to wait for every secondary to acknowledge. The primary can continue accepting requests even if secondaries are partitioned. On the flip side, it’s possible that an out-of-date secondary will return stale reads if a read request reaches it before the latest changes are replicated. Data loss could even occur during failover if a secondary gets promoted to primary before it sees an update accepted by the old primary. Since the state of the primary determines the state of the system, any changes not replicated to the new primary would be lost.

Synchronous and asynchronous replication

Two-phase commit

Two-phase commit (2PC) is a simple protocol that offers strong consistency. It works as a distributed transaction. A primary node instructs the secondary nodes to perform an operation and to report whether they think the transaction should commit or abort. Each secondary performs the work, saving the results to a temporary log, and reports the outcome. If every node reports success then the primary instructs the secondaries to proceed with the commit. If any node reports an error, the primary instructs the secondaries to abort the transaction. A real world example is a marriage ceremony:

Officiant: Adam, do you take Steve to be your husband?
Adam: Yes!
Officiant: Steve, do you take Adam to be your husband?
Steve: Yes!
Officiant: You are now married!

The officiant is the primary node and Adam and Steve are both secondaries. The act of marriage is the transaction. Crucially, it is not sufficient for Adam and Steve to both say “Yes!”. The marriage is only confirmed (i.e. the transaction committed) when they both accept and the officiant confirms it.

Two-phase commit is superficially similar to synchronous primary-secondary replication. The difference is that in primary-secondary replication, the secondaries have no choice but to accept whatever the primary pushes out to them. In two-phase commit, the primary is seeking consensus: it is trying to get a system of potentially faulty nodes to agree to commit a value. Each secondary has the choice to reject the proposed value by responding with an “abort” message. Two-phase commit can be used to agree on any value. Synchronous primary-secondary replication is a special case of two-phase commit in which the values the primary proposes are always new changes and the secondaries always accept them.

Due to these similarities, two-phase commit suffers from the same limitations as synchronous replication. It is very susceptible to network partitions and delays. Each change requires at least four network transmissions per node. The system can only proceed at the speed of the slowest node and all nodes must be available. Because of this, two-phase commit is relatively rare in production systems. There is a three-phase variant that avoids being blocked by a slow node but it is still partition intolerant. Can we do better? Yes!

Fault-tolerant consensus protocols

The consensus offered by two-phase commit sounds useful. It’s a shame that the protocol is so easily blocked by slow nodes and partitions. What if we loosen the requirement that every secondary has to agree? Fault-tolerant consensus protocols achieve consensus while allowing some nodes to fail. At a high level, fault-tolerant consensus protocols work like two-phase commit but only require a response from a majority of secondaries. They achieve this at the cost of much greater implementation complexity. The protocols are sophisticated and hard to implement correctly in real world systems with Byzantine fault models.

The first fault-tolerant consensus protocol was developed in 1989 by Leslie Lamport (mentioned in the introduction) and is known as Paxos. Lamport thought that his protocol would be more memorable if he gave it an exciting back story and so he presented it as if he had unearthed the voting rules of the ancient parliament of the Greek island Paxos. He even went so far as to dress up as Indiana Jones when presenting it at conferences. Unfortunately, computer scientists are not known for their sense of humour and the performance did not have the desired effect. It took a few more years for the significance of the protocol to be fully appreciated. Partly this is due to its complexity. Even with the faux archaeology stripped away, Paxos is notoriously difficult to understand and implement correctly. A later protocol called Raft offers the same guarantees and is designed to be easier to understand and implement. Both Paxos and Raft can tolerate the failure of up to N nodes if 2N + 1 remain available. It will seem to clients that they are interacting with a single, reliable system even if a minority of the nodes in the system fail.

Let’s get a sense of how fault-tolerance consensus works by looking at Raft. There are far too many details to give a comprehensive explanation here, so I’ll focus on the core elements. Nodes in the system reach consensus through successive rounds of voting divided into periods called “terms”. Each term begins with an election to be the leader for that term. Every node starts off as a follower. After a random timeout, it will promote itself to candidate and request that the other nodes elect it as leader. A follower will vote for a candidate if it hasn’t already voted for another. The candidate that receives a majority of votes is elected leader. Interestingly, leader election is thus itself an example of achieving consensus. The leader acts as the primary node, coordinating work and handling communication with clients.

When the leader receives a request from a client, it writes the change to its local log and broadcasts the change to the followers, who each write the changes to their own local log and respond with a confirmatory vote. The leader waits for a quorum of votes before committing and instructing the followers to commit the changes. A quorum represents the minimum number of votes needed for a decision to be considered valid. In the U.S. Senate, for example, a majority of senators must be present to have a quorum. Let’s say the total number of votes in the system is N, the quorum for read operations is R and the quorum for write operations is W. By imposing a few simple restrictions on the quorum values, the system can ensure consensus:

W > N/2. This means a majority of nodes must vote for every write operation. Since a node can only vote for one write operation per round, this restriction ensures that only one write operation succeeds per round.
R + W > N. This means that every read and every write operation will share at least one node. The read quorum will always have at least one member with the latest value.

Setting R and W to smaller values would allow reads and writes to be performed more quickly, since fewer followers have to vote on the operation, at the cost of permitting inconsistent behaviour.

The beauty of Raft is that a set of comparatively simple rules is enough to handle many failure cases. Imagine that a partition cuts off the leader from a majority of the followers. The leader will not receive a quorum of votes and so will not be able to commit any changes. This prevents inconsistency but the system cannot make progress without its leader. When the leader cannot communicate with the other nodes, we don’t know whether it’s because of a network issue or the leader has become incapacitated in some way. Followers that don’t hear from the leader will wait a random amount of time before starting their own leadership candidacy. Due to the leadership voting requirements, any new leader must be able to communicate with at least a majority of nodes. If the partition is later resolved and the previous leader reappears, the leader with the higher term takes precedence. Thus the protocol provides fault-tolerant functionality.

I’ve skipped over many of the details but I hope the general mechanism is clear. Nodes elect a leader amongst themselves and perform the same operations in the same order specified by the leader. Since the system as a whole agrees on the order of operations, we have linearizability even though there is no global clock. As you might imagine, all of the voting and elections add latency to every operation and so such protocols will never be super performant. What fault-tolerant consensus protocols offer is linearizable consistency and partition tolerance so long as a majority of nodes survive. Paxos and Raft can’t quite claim to be available according to the CAP theorem’s definition because requests to nodes in a minority partition must fail, but in practice this is not an issue because the nodes in the majority partition continue to operate.

I highly recommend that you look at the Raft visualisation resources in the further reading. The protocols are difficult to understand by reading detailed instructions. You can get a much more intuitive sense of their operation by watching them work. They are truly remarkable. You will find that many distributed systems use some kind of Paxos- or Raft-based implementation for at least some of their internal state. Google Spanner uses Paxos to manage replication across nodes. Kubernetes’ etcd store is based on Raft.

Conclusion

Distributed systems offer many alluring benefits but come with a hefty complexity cost. Tasks that were once straightforward become much more complicated when multiple nodes are involved. Things can and frequently do go wrong in confusing, non-obvious and very difficult to debug ways. Yet distributed systems are simply unavoidable if you want to tackle big problems because they free us from the limitations of a single machine.

Distributed systems are characterised by independent nodes communicating over an unreliable network to achieve a common goal. An unreliable network is challenging because it makes it impossible to know if message delivery has succeeded. A particularly interesting (and frustrating!) feature of distributed systems is that the system has no global clock. Each node has its own, local perception of time with no ordering between nodes. This complicates achieving consistency because nodes may have different perceptions of the order in which operations occur, leading to different outcomes.

The CAP theorem declares that when a network failure occurs, the system must either sacrifice availability, by not accepting some requests, or consistency, by allowing conflicting writes (known as split brain) or reading old values (stale reads). The important point is no distributed system can offer the same consistency and performance guarantees as a single-node system. Rather than seeing consistency and availability as binary states, we should see them as ranges of behaviour. If we loosen our consistency requirement to something weaker than the CAP theorem’s linearizability, we can have both consistency and availability to some degree. Consistency requires communication to keep data stores in sync and so the stronger the consistency offered by a system, the less it can be available and vice versa. Weaker forms of consistency include sequential consistency, which is like an asynchronous message queue, and eventual consistency, which only guarantees that the system eventually converges on the correct value. All of them are appropriate in certain situations. There is no one size that fits all.

A simple way to ensure strong consistency is to use synchronous primary-secondary replication. One node is designated leader, handles all changes and replicates them to secondary nodes. This works well in the specific case of a database replicating changes across the system. A more general protocol is known as two-phase commit, in which all nodes must agree to commit changes as part of a distributed transaction. The problem with both mechanisms is that they are very sensitive to network errors or slow nodes. Asynchronous replication is one solution but it introduces the risk of inconsistency. Fault-tolerant consensus protocols, such as Paxos and Raft, resemble a two-phase commit protocol in which the nodes elect their own leader and node failures is tolerated as long as a majority of nodes remain.

I hope that this chapter encourages you to think about how networked computers behave as part of a bigger system. We won’t all get the opportunity to work with some fancy Kubernetes cluster or Spanner database, but every web developer is working on a distributed system in some way. You are now better equipped to reason about how the system operates, how it handles its state and how it can go wrong.