Programming is undergoing a major transition right before our eyes. In the olden days, programs were executed sequentially by a relatively simple processor. That straightforward model has been upended by the computing industry’s pursuit of higher performance. Innovations in chip design require new programming paradigms.
For decades, the processor industry was in thrall to Moore’s law: an observation by Henry Moore, CEO of Intel, that the number of transistors on a processor doubled every two years. Another observation, Dennard scaling, stated that a chip’s power consumption would remain constant as the transistors became smaller and denser. Taken together, the two observations correctly predicted that performance per watt would increase rapidly. Between 1998 and 2002, processor speeds increased from 300MHz in a Pentium II to 3GHz in a Pentium 4. That’s ten times faster in four years. For programmers this was a good thing. Processor speeds increased so rapidly that upgrading hardware was the easiest way to improve performance.
Such rapid performance increases could not go on forever. Components had to become smaller and smaller so that more could be packed into the same space. Increasingly tiny components made it more and more difficult to dissipate heat and keep the electrons flowing correctly. Clock speeds increased in smaller and smaller increments until finally they increased no more. Manufacturers desperately sought ways to deliver the performance improvements that their customers had come to expect. With no easy way to further increase clock speeds, they came up with the wheeze of bundling multiple processors, or cores, on to the same chip. And thus the multi-core revolution was born.
In theory, a processor with two cores can do twice as much work in the same time as a processor with a single core. In reality, things are not so simple. Doubling the number of cores does not directly improve performance in the same way that doubling the clock speed does. A single-threaded program on a multi-core processor will utilise a single core and leave the others idle. The CPU cannot automatically spread the computation over multiple cores. We must shift to a new programming paradigm in which tasks execute concurrently on multiple cores.
In this chapter we’ll use a few, closely related but distinct concepts. Let’s go through each of them in turn.
Concurrency means dealing with multiple things at once. A program is concurrent if it is working on multiple tasks at the same time. That doesn’t mean that it has to be literally performing multiple tasks at the same time. Cast your mind back to the multi-tasking scheduler we saw in the OS chapter. On a single-threaded processor, there is only ever one process actually executing at any given time, but the scheduler manages to deal with multiple tasks by giving each slices of processor time. The concurrent behaviour observed by the user is created by the scheduler in software.
Parallelism means actually doing multiple things at once. Parallelism requires hardware support. A single-threaded processor physically cannot execute multiple instructions simultaneously. Either hardware threads or multiple cores are necessary. A problem is parallelisable if it can be solved by breaking it into sub-problems and solving them in parallel. Counting votes in an election is a parallelisable task: count the votes in each constituency in parallel and then sum the totals to find the national result. Parallelisable tasks are desirable in computing because they can easily make use of all available cores.
To take an example, you are reading this book. We can think of each chapter as a separate task. You might choose to read the book from cover to cover (good for you!), one chapter after another. That is analogous to a processor executing tasks sequentially. Alternatively, you might start one chapter and, midway through, follow a reference to a section in another chapter. You are now reading two chapters concurrently. You are literally reading only one chapter at any given moment, of course, but you are in the process of reading two chapters. To read two chapters in parallel, however, you would need to have two heads and two books!
Make sure that you grasp the distinction between concurrency and parallelism. Concurrency is the slightly vague idea of having multiple, unfinished tasks in progress at the same time. It comes from the behaviour of software. Parallelism means doing multiple things literally at the same time and requires hardware support.
We saw in the scheduling section of the OS chapter that threads can be either CPU-bound, meaning they need to do work on the processor, or I/O-bound, meaning they are waiting for an I/O operation e.g. reading a value from memory or sending a file over a network. When a thread makes an I/O request via a system call, the scheduler marks the thread as blocked pending request completion and schedules something else. The problem is that one blocked thread is enough to block an entire process, even though other threads within the process might still have available work to do.
The aim of concurrent programming is to multiplex multiple tasks on to a single thread. From the scheduler’s perspective, the thread will seem completely CPU-bound and so won’t be suspended. There are a few things that need to be in place for this to work. The operating system will need to provide non-blocking versions of system calls. The program will need to somehow manage scheduling its own tasks. This is difficult to get right and so it is normal to rely on a library or a language-specific runtime that takes care of all the scheduling. When I refer to “scheduler” from now on, I mean this scheduler internal to the program, rather than the OS scheduler, unless I specify otherwise.
Concurrency seems pretty cool. We can write code that handles multiple things at once and we don’t have to worry about anything blocking the whole program. What’s not to like?
The problem is non-determinacy. In a synchronous program, we know the order in which tasks execute and so we can predict how they will affect the system. For example, function A writes a value to memory and then function B reads that value. The code generates the same output every time it is called. In a concurrent program, the executing tasks are arbitrarily interleaved by the scheduler. It is entirely possible that function B will run first and read the value before function A writes it, thus generating a different output. Worse, that will only happen sometimes: the scheduler can choose a different ordering every time. The upshot is that the program’s output may vary depending on how its tasks have been scheduled. This is non-determinism. It is very difficult to reason about code and have assurances about its correctness in the presence of non-determinism. How can you reliably test code that breaks only rarely? Perhaps when you write and test the code, the scheduler consistently chooses an order that functions correctly. Then, weeks from now, for whatever reason, the scheduler chooses an ordering that breaks your code and everything blows up in production.
The root cause of the problem with functions A and B above is not so much the random interleaving but the fact that the two functions have shared, mutable state. I say “shared” because they both have read access to the same memory addresses and “mutable” because they both can modify the data in those addresses. If there were no shared, mutable state between the two functions, it wouldn’t really matter in which order they were executed.
A classic example of the problems in shared, mutable state is a broken counter implemented using two threads (in Java-like pseudo-code):
In this code, we start two threads, each incrementing
count 10,000 times. You might think that this is a faster way to count to 20,000 because the work is split between two threads. In fact, on a system with multiple hardware threads or cores, the program doesn’t even work properly and
count will be lower than expected. Its exact value will vary each time the program runs, due to non-determinacy.
Each thread has to read
count, increment its value and write the modification back to the variable. The problem is that both threads are trying to do that at the same time. Here’s an example of how things will go wrong. Thread one starts and reads the initial value. Depending on the scheduler’s mood, there is a chance that it will suspend thread one and schedule thread two. The second thread will run for a while, incrementing
count by some amount. The scheduler then swaps the threads and thread one picks up from where it left off. Crucially, thread one will still hold the same value it had for
count when it was first suspended. It will be completely unaware of the changes made by thread two and so its modifications will overwrite thread two’s work. This happens repeatedly, leaving
count’s final value far off the expected 20,000.
When multiple threads compete to update shared, mutable state, we have what is know as a data race. More generally, competition for access to any resource is known as a race condition. The observed outcome depends on which thread wins the “race”, as determined by the order the scheduler chooses to run them in. This is the source of the observed non-determinism and a relatively funny joke (by computing joke standards):
What do we want?
When do we want it?
- No race conditions!
Race conditions occur when two or more threads simultaneously execute a critical section of code. Only one thread may be in the critical section at any time. The critical section in the example above is
count++, which is just a single line of code but actually involves reading the current value, incrementing it and then writing it back out. The way to achieve safe, concurrent code without race conditions and non-determinism is to control, or synchronise, access to critical sections to prevent multiple threads entering them at the same time. All of the concurrency models in this chapter solve this problem in different ways.
The simplest concurrency approach a language can take is to only support the basic concurrency primitives provided by the underlying operating system. If the programmer wants to do anything fancy, they’ll have to build it themselves (or find a library implementation). In Ruby, for example, a network request will block by default:
If you want concurrency within a Ruby program, you need to create threads. Remember that threads operate within the same process’s address space. If a program creates three threads, it will have a total of four (including the original process thread) independently executing threads, all jumping around the address space and changing things as they please. Since they share the same address space, a change made by one thread will be immediately visible to the others: shared, mutable state.
Operating systems provide a very simple synchronisation mechanism known as the lock. If you want to make something inaccessible, you lock it away. A computing lock is no different. A lock is a record that a particular thread has exclusive access to a particular resource. A thread claims a resource by recording that it is using the resource. It does its work and, when finished, releases the lock by erasing the record. A second thread coming along and trying to access the resource must wait for the lock to be released before it can claim the lock for itself. By requiring that a thread claim a particular lock before entering a critical section and releasing the lock when it leaves, we ensure that only one thread is ever in the critical section. Such locks are also known as mutexes (from “mutual exclusion”).
Note that thread two is blocked when it cannot claim the lock. Mutual exclusion means that accesses to the resource are now performed synchronously. The synchronisation solves our race condition problem but removes the concurrency, which was the whole purpose in the first place. Achieving the optimal balance of correctness and concurrency via locks requires choosing the correct granularity. Ideally, we want to keep critical sections as small as possible while still preventing non-determinism.
More sophisticated locks allow for more precise access control. Threads needing only to read a value can each claim read-only, shared locks. Should another thread need to modify the value, it can claim an exclusive lock but only when there are no shared locks. Databases frequently use such mechanisms to coordinate concurrent reads and writes. A problem here is when there are so many reads that there is always at least one shared lock in place and so a modifying thread is starved and can never claim its lock.
Still worse, using locks on multiple resources can lead to a situation known as deadlock. This occurs when two threads have each claimed one lock and are waiting for the other thread to release the other lock. Both threads are stuck in an infinite loop.
To avoid deadlocks, locks must always be claimed and released in the same order. That sounds straightforward but is no simple task in a complex application. Many very competent developers have horrendous stories of days spent tracking down very subtle bugs caused by incorrect lock usage. I think part of the problem is that you have to read code in a very different way. Look at the example below (again in Java-like pseudo-code):
At first glance, this seems fine because we consistently claim first the sender lock and then the receiver lock before performing the transfer. When reading the code, however, try to imagine the various ways the two threads might be interleaved. One possible ordering is where
Thread1 claims a lock on
a and then
Thread2 is scheduled and claims a lock on
b, resulting in deadlock because neither thread can claim the second lock. In this example, it’s fairly easy to spot how the lock orderings are reversed Now imagine that the two calls to
transfer are in completely different areas of the codebase and only occasionally run at the same time. The potential for deadlock would not be at all obvious.
When reading code, it’s natural to make a sequential, synchronous mental thread of execution. When dealing with multiple threads, you have to remember that between any two operations the thread may have been paused and another scheduled thread might have changed any of the shared, mutable state. A value that was
true in the conditional of an
if statement might have become
console.log("1") operation, which it does synchronously (without enqueuing a task) by outputting
console.log request. It then returns from
console.log. The output is therefore:
The picture is further complicated by micro-tasks (also known as jobs). These slippery little creatures are like high priority tasks. They go in a separate queue that is processed after callbacks and between tasks. Promises and their handlers, which we’ll look at in more detail below, are good examples of micro-tasks. In this example, what do you think the output will be?
We know that specifying no timeout value for
window.setTimeout enqueues the callback. The promise callback is also enqueued, but because promise handlers are micro-tasks, it wriggles in ahead of the timeout callback:
Treating micro-tasks in this way improves the responsiveness of the system. The whole point of promise callbacks is that they are executed as soon as possible after the promise resolves, so it makes sense to give them priority. Sending them to the back of the task queue could cause an unacceptably long delay.
addEventListener) but became ubiquitous in node.js. The thing to remember about callbacks is that the callback function executes later, when the response is ready. Performing a sequence of asynchronous operations therefore requires nested callbacks, affectionately known as callback hell:
There are various techniques to make it look a little better, but you can’t get away from the fundamental weirdness that the program’s control flows through the callbacks. The nature of asynchronous code is that a variable’s value depends on when the current scope is executing. An operation’s result only exists in the scope of its callback. That makes sense because, as we’ve just seen, the callback task is only created when the operation completes. We must be careful to structure our code so that tasks needing the result are only scheduled when the result is available. That’s why, in the example above, we have to call
validateRequest’s callback: it’s only in that scope that the necessary
.then(), which returns the same promise to allow for chaining:
Promises introduce the concept of an asynchronous value. Notice that
handleRequest above returns a promise. That’s more meaningful than returning
undefined and having to pass in a callback to access the result. Yet promises are no panacea. Asynchronous operations have to be sequenced as a chain of
.then() calls, which are subjectively ugly and objectively very different to standard, synchronous code. Worse, if an error occurs in a promise it is rejected, requiring the
.catch() handler at the bottom of the chain. This is different to the normal
async keyword. We call them and assign their return values as normal on the understanding that the return value may not actually exist until later. Under the hood, async functions actually return promises. The
await keyword creates a synchronisation point by which the promise must be resolved. If the promise has not yet resolved, the task will suspend until the value is available. We can write asynchronous code that looks almost like synchronous code. The main difference is that we must remember to
await the return value when we want the resolved value:
It’s much easier to see here the sequence of actions than in either the callback or promise examples. Since async/await are syntactic sugar around promises, we can use them directly on promises too. That might help clarify what they are doing. The following async function awaits a promise that resolves after a timeout in order to emulate a slow, asynchronous operation (e.g. a database query):
Here we call
slowFetchId and assign its return value to
id. We log what it contains before and after we await the result:
slowFetchId is async, it immediately returns a pending promise. Note that the returned promise represents the whole function
example and so it actually contains the separate promise defined within
example. When we want the actual value from the promise, we use
await to suspend the current task execution until
id has settled by resolving into a value or rejecting with an error.
slowFetchId is itself waiting on the timeout promise, so it can’t do anything until the runtime triggers the timeout callback, which enqueues a micro-task to resolve the promise and allows
slowFetchId to return
5. The promise held in
id can now resolve and execution moves on to the final logging statement.
The main limitation of async/await is that any function calling an async function also needs to be marked as async. Making just one function async might cause a whole cascade of tedious refactoring. You might see this referred to as having to “colour” a function. The terminology comes from an influential blog post on the topic, which I have included in the further reading. While not perfect, async/await is a big improvement. There was a dark period when callback hell coexisted with various, mutually incompatible promise libraries. Best practices seemed to change from day to day. Thanks to the combination of built-in promises and async/await, things are now settling down. Using promises, we can express the concept of an asynchronous value and with async/await, we can express asynchronous operations in a clean and consistent manner.
You might expect Ryan Dahl, the creator of node.js, to be bullish about it as a server platform. In fact, he says:
I think Node is not the best system to build a massive server web. I would use Go for that. And honestly, that’s the reason why I left Node. It was the realisation that: oh, actually, this is not the best server-side system ever.
Concurrency in Go is implemented via goroutines: lightweight, userspace threads created and managed by the Go runtime. A single Go process might contain thousands or even millions of goroutines. Switching between goroutines is much faster than switching between OS threads and is not much slower than a simple function call. To achieve concurrency, a Go programmer simply has to create a goroutine for each concurrent task. If there is suitable hardware support, the Go runtime will even execute them in parallel with no additional effort required from the programmer.
By default, channels are unbuffered, meaning that they can only hold a single value. Reading from an empty channel and writing to a full channel will both block the current goroutine. That gives another goroutine an opportunity to fill or empty the channel as appropriate. Channel operations therefore naturally create synchronisation points in the code. The select block is a bit of syntax that allows a goroutine to wait on multiple channel operations. In Go, we use channels and select blocks to express the logical dependencies between concurrent goroutines and leave the scheduler to work out the details.
None of these concepts are particularly unique or innovative. The strength of Go’s concurrency model comes from being a simple, well-engineered combination of all three. The theoretical underpinning is known as communicating sequential processes (CSP) and was first developed by C.A.R. Hoare back in 1978 (see further reading).
Implementing its own form of userspace threads means that the Go runtime has to implement its own userspace scheduler. The OS scheduler plays no part in scheduling goroutines. It only sees the threads of the Go runtime itself. It’s up to the Go runtime to efficiently schedule concurrent goroutines. The Go runtime scheduler tries to avoid the OS scheduler suspending its threads by schedule goroutines so that those threads are continually busy. If the runtime’s threads run out of work or become blocked, the OS scheduler might suspend the entire Go runtime process, halting all goroutines. The OS scheduler can’t see that some other goroutines might have been able to continue doing useful work.
Let’s look at the Go scheduler in more detail. The scheduler creates a logical processor (P) for each virtual core (that is, counting each hardware thread in each physical core). Each P is assigned a physical thread called a machine (M). The OS schedules the M threads according to its own rules. Goroutines (G) are created when the program starts and by using the special
go keyword. Each goroutine is scheduled on to an M. Each P has a local run queue (LRQ) of goroutines assigned to it. The logical processor P runs the goroutines from the LRQ on the machine M. There is a separate global run queue (GRQ) for goroutines that aren’t yet allocated a logical processor. It all looks like this:
Goroutines can be in one of three states with fairly self-explanatory names: waiting when it needs some resource in order to continue, runnable when it’s ready to go and executing when it’s actually running on a machine. The scheduler is cooperative because goroutines voluntarily relinquish the machine at defined points, such as system calls and when the
go keyword is used. The scheduler doesn’t pre-emptively suspend goroutines as the OS scheduler would do with processes. The Go scheduler implements a work-stealing algorithm: when a logical processor empties its LRQ, it will look for pending work in the global queue or steal pending goroutines from another logical processor. It’s like an annoyingly high achieving colleague who finishes all their work and starts doing some of your pending tickets to fill their time.
A disadvantage of goroutines, and userspace threading in general, is that each goroutine requires at least a few kilobytes of memory to store its stack. This is normally inconsequential but something to bear in mind if your application is going to create huge numbers of goroutines. The Go runtime starts each goroutine off with 2Kb of stack space. If the goroutine calls lots of functions and is in danger of running out of stack space, the runtime handles this by allocating a bigger block of memory and copying the entire stack over to the new location. It’s a bit like how dynamic arrays automatically reallocate when they reach their capacity. Moving the stack entails rewriting the address of every variable on the stack. Go is only able to do this because it tracks every use of a variable’s address and so can rewrite the addresses as necessary.
We can rely on the blocking nature of channel operations to automatically create synchronisation points in our code. The runtime will detect when a goroutine is blocked and switch to another runnable goroutine. Eventually, assuming no deadlocks, a goroutine will run that unblocks the first and progress continues. We can write our code in a straightforward, declarative style without worrying about sequencing operations in the correct order. The scheduler takes care of that. There’s therefore very little distinction between synchronous and asynchronous Go code. A direct function call occurs synchronously in the caller’s goroutine. Calling a function with the
go keyword creates a new, asynchronous goroutine.
Let’s look at how to use channels to synchronise goroutines. In this example, we offload a slow computation to a separate goroutine:
This doesn’t work as intended:
The program doesn’t wait for
slowCompute before exiting! The problem is that the
main goroutine has not been instructed to wait for
slowCompute. What are we missing? Synchronisation! We fix this by having
slowCompute write to a channel and
main read from it. As we know, a read operation on an empty channel will block until a value is sent:
<-result syntax means “read from
result <- means “write to
result”. This works as expected:
There’s lots of interesting things going on here! We create two channels:
results. Both have a buffer size of
100. We start up four workers in separate goroutines. We then write twenty jobs to the
jobs channel. The workers pull jobs from the channel using the
range keyword. The workers sleep briefly during processing to simulate performing a time consuming task. Once we’ve put our jobs into the channel, we close it to indicate that no more values will be written. That stops the workers ranging infinitely over the channel, forever waiting for values that never arrive. A
waitGroup is a counter that’s incremented every time we call
Add(1) and decremented every time we call
wg.Wait() blocks the
main goroutine until the counter is zero. In this case, that means that every worker has finished and all of the work is complete. The output will look something like this, though the exact ordering is non-deterministic and depends on how the scheduler chooses to run things:
Go makes it easier to sequence concurrent operations and to reason about how control flows through a concurrent application. Nevertheless, writing correct concurrent code still requires some care. Here is a trivial example that deadlocks:
We start off a worker in a separate goroutine and then block until a value is available using
select. That will work. Trying to send the second value to
channel is what causes the deadlock. Since
channel is unbuffered, the write will block the main goroutine until something reads it. That gives the scheduler a chance to run other goroutines and hopefully one of them will read from the channel. However, in this instance there is no other goroutine trying to read from the channel and so the program blocks forever – deadlock! Admittedly, the cause of the deadlock in this contrived example is pretty obvious but it’s surprisingly easy to cause subtle deadlocks in more complex code. On the plus side, the Go compiler includes a deadlock detector that can often find deadlocks and even show the faulty line of code.
We live in interesting, concurrent times. In this chapter, we saw that changes in computer architecture are driving changes in how we write software. Programs need to be able to take advantage of multiple processor cores. Concurrent programming involves writing code that can deal with multiple things at once. Parallelism means literally performing multiple operations at once and requires hardware support. Asynchronous programming means writing code that does not block while waiting for operations to complete. All of them require thinking about programming in a different way.
Many of the textbooks cited in the operating systems chapter include thorough treatments of the concurrency primitives provided by operating systems. If you haven’t already read Operating Systems: Three Easy Pieces, now would be a good time to start. Herb Sutter’s article The Free Lunch is Over: A Fundamental Turn Toward Concurrency in Software is a little dated now but provides some useful background context on the hardware changes driving developments in concurrent programming.
When those textbooks warn gravely of how threads and locks lead to dreadful concurrency bugs, it’s sometimes difficult to grasp just how subtle the bugs can be. Deadlock Empire is a fun, little game that challenges you to find and exploit concurrency bugs. You must slay an evil dragon that, fortunately for us, uses locking mechanisms to protect itself. The simple fact that finding concurrency bugs in lock-based code can be made into a mind-bending puzzle game illustrates just how hard it is to write correct code with locks.
If you would like to know more about the basis for Go’s concurrency model, I recommend the original paper: Communicating Sequential Processes by C.A.R. Hoare. It’s easy to find online and you can even find the problems from the paper solved in Go in this package.
Seven Concurrency Models in Seven Weeks by Paul Butcher provides a pacey introduction to a range of concurrency models implemented in a few languages (including Java, Erlang and Clojure). I recommend it both as a good overview of the problem space and as a window into some environments you might not yet have come across (particularly Erlang).