Contents

Operating systems

Introduction

In the previous chapter, we learned how a real computer is put together. You may be forgiven for thinking that it still seems very far from how you actually use a computer. That’s because there’s a whole layer sitting between the hardware and you: the operating system (OS).

The OS provides solutions to some tricky problems. How does the processor know where to begin executing code or where to find the next program? How do you change between running programs? How do I stop an evil program locking me out of the entire system? To handle these issues we need hardware and software to work very closely together. The software side of the solution is the OS. It’s the first program that runs on the computer and it acts as an interface between the user and the hardware. It abstracts away many of the details and peculiarities of different hardware components and provides a consistent platform on which other programs can run. It makes the computer easier to use, more secure and more performant.

Not giving user programs direct access to the hardware is a very good thing. Imagine writing a program where you had to handle the cursor yourself. You’d need to check for and identify any attached mice or trackpads, listen for and correctly interpret the hardware signals generated, recognise different types of clicks, gestures, taps and so on. It would be a nightmare! It’s much better for everyone concerned if the OS can take care of all this and simply feed the user program a stream of notifications about where the cursor is and what it is doing.

Modern OSes include Microsoft Windows, Apple’s macOS and GNU/Linux. They are all immensely complex conglomerations of software components that can handle a huge range of tasks. In this chapter, we’ll try to pare things down to the core commonalities. We’ll see how a computer starts up. Then we’ll briefly review some necessary additions to our hardware model. We’ll examine the implications they have on the OS architecture. Then we’ll look at the core abstractions an OS provides to manage the key system components: the processor, memory and storage. In each case, the OS provides a virtual representation of the underlying resource.

Common operating systems

There have been many operating systems over the years. Let’s look at a few of the most popular and how they differ from each other.

Windows is a family of closed source operating systems developed by Microsoft. The very first Windows was not much more than a visual interface on top of the older MS-DOS (Disk Operating System). DOS is interesting as a salutary tale because it was a very simple OS that lacked a lot of what we’d now consider to be standard features. It gave user programs more or less complete control over the hardware and relied on them to behave themselves. This worked as well as you might expect. Later versions of Windows moved away from DOS entirely in a bid to improve its terrible reputation for reliability and security. Windows has traditionally focused on providing an easy to use interface for non-technical users, backwards compatibility and supporting a wide range of hardware.

Unix was an OS developed in the 1970s at Bell Labs that spawned a whole family of derivatives and reimplementations. “Unix” nowadays is more like a description of how to design an OS, formalised in a specification known as POSIX. Strictly speaking Linux refers to an open source reimplementation of the original Unix kernel, dating from the early 1990s. Many of the standard Linux system utilities were provided by a separate project, GNU (a recursive acronym of GNU’s Not Unix), and so you’ll sometimes see the complete OS referred to as GNU/Linux. There are many distributions (or distros) consisting of the Linux kernel packaged with a variety of extensions and user programs to form a complete operating system. Ubuntu, CentOS and Arch are examples of popular Linux distros, each targeting a different use case. Of course, the world has moved on a lot since the 1990s and Linux has continued to develop and add new features that never existed in Unix. Linux is popular with programmers and technically minded users because it offers complete control over the system to those who are prepared to learn how to use it. Being open source means that the user can read and, if they feel up to it, modify the operating system’s code. It’s basically the opposite of Windows. Linux is extremely popular on servers due to its reliability, performance and tweakability.

macOS (previously OS X) is the closed source, proprietary OS that runs on Apple computers. It is built around an open source, Unix-like core called Darwin, which is itself based on a branch of the Unix family known as the Berkeley Software Distribution (BSD). The main distinction between BSD and Linux is that the BSD licence allows it to be used in closed source projects, unlike Linux. On top of this open source foundation, Apple has built a large amount of proprietary, closed source code that makes macOS distinct from Linux. Because they share the same Unix-like fundamentals, Linux and macOS have many similarities. Despite that, programs built to run on Linux will not run on macOS (and vice versa) because the code on top of the common base differs too much. The combination of Unix-like foundation and good user interface makes Apple products popular with developers. Some will tell you that real developers use Linux. Ignore them.

The boot process

To understand why we want an operating system, we’ll start at the very beginning. Let’s imagine that we’ve built a computer like the one described in the architecture chapter. It’s basic but sufficient to run programs and do useful work. How do we actually get it to start running a program? Let’s think about what we’ve learned already. We know that we need to load the program instructions into memory and set the processor’s program counter to the address of the first instruction. The processor will then start fetching and executing our program.

How do we actually get the instructions into memory and how do we tell the processor where to begin? Both of these puzzles can be solved easily if we’re willing to make changes to the hardware. We can write our program to a piece of read-only memory (ROM) and wire it up so that it lives at a fixed memory address. ROM is memory that can only be written once. The benefit is that it doesn’t need electricity to maintain the data. Our program will persist in the ROM even when the machine is powered off. We can then hardcode the ROM start address into the processor’s program counter. Now every time we power on the processor, it will start running our application!

Many simple microcontrollers work like this. Though workable, we have a few problems. For one thing, how do we change the running program? We could load different programs on to different ROMs and just physically switch them out. This is what old games consoles used to do but clearly this won’t work for general computing. A better idea would be to put a special program on the ROM that would let the user select what they wanted to run. It would then load the chosen program into memory and tell the processor where to start executing. What we’ve just invented is a bootloader.

When a computer boots, or starts up, it is hardwired to first execute code stored in a small piece of ROM. Previously, this would have been the BIOS (basic input/output system) but EFI (extensible firmware interface) is more common nowadays. The distinction isn’t terribly important because both perform the same role: work out what hardware the system has and initialise it, find the OS’s code, load it into memory and start executing it.

Every operating system includes a bootloader which is stored on the computer’s disk at the location specified by the BIOS/EFI standard. The bootloader’s sole job is to get the OS up and running. Due to various terribly tedious technical constraints imposed on bootloaders over the years, the modern process actually involves the first bootloader calling a second, more powerful bootloader that then loads and starts the OS. Happily, there’s no need to know all the arcane details as long as you get the general concept.

The boot process begins by the computer powering up. The BIOS/EFI runs and then hands control over to the bootloader, which loads the OS. Finally, the boot process is complete when the OS has finished initialising and is ready to receive user input.

Interrupts: hardware support for software

Under the processing model we developed in the previous chapter, there’s no obvious way to force a program to relinquish control of the processor. Obviously, switching programs is something that it would be convenient to be able to do. There are important security implications too. If I can’t stop a running program, I have no way to stop a program that’s trapped in an infinite loop from freezing the whole system. Similarly, I can’t intercept attempts by an evil program to reformat the hard disk or steal my passwords.

The problem is that our processor gives the running program complete control. The program can use that control to prevent any attempts to transfer control. Frankly, the program can do whatever it likes, up to and including destroying the system. One solution could be to only ever allow the OS to run directly on the processor. The OS would load other programs, analyse their instructions, pass them through to the processor if they looked safe and flag them if they seemed dangerous. This approach could work but would entail a high performance cost. Every user program instruction would need many OS instructions to analyse and verify its safety.

A simple hardware modification, the interrupt, means that user code can run directly on the processor without the OS completely relinquishing control. An interrupt is a hardware signal that causes the processor to stop what it’s currently doing and switch to executing a chunk of code known as an interrupt handler. The processor circuitry is amended to trigger an interrupt on a fixed timer or whenever a program attempts to perform a potentially risky operation. The OS sets up handlers for each interrupt. The handlers will examine the cause of the interrupt and take corrective action if need be.

There are two main types of interrupts. Hardware interrupts are triggered by – guess what! – hardware devices such as network interfaces and keyboards. They essentially have a direct line into the processor through which they can shout “hey! The user has pushed down a key. You need to handle this right now!”. They are also known as asynchronous interrupts because they occur outside the normal program flow and can happen at any point in the execution cycle. This means that the processor has to be able to stop an instruction mid-execution and deal with the interrupt.

Software interrupts are used when the processor itself detects that something is preventing it from carrying on. For example, it might realise that it’s been asked to divide by zero, which is a mathematical impossibility. It can’t proceed so all it can do is report the error and hope something else can handle it. Such interrupts are synchronous because they occur within the normal flow of program execution. They are also called exceptions because they indicate that something exceptional has happened to the program.

The implication of adding interrupts is that we now have two categories of code. Code that runs whenever an interrupt is triggered is privileged because it has oversight and control over unprivileged code. We add a new hardware flag to monitor how privileged the currently executing code is. When the current code is privileged, we set the flag and the processor is said to be running in supervisor mode. The idea is that any code running in this mode has a supervisory role over code that does not run in this mode. If the flag is not set, the processor is in user mode and running unprivileged code. Before executing sensitive instructions, such as modifying files, the processor checks that the flag is set and triggers an interrupt if it is not set.

On receiving an interrupt, the processor stops its normal execution. It stores the state of the registers, flags and so on for safekeeping. The interrupt signal includes a number to identify which kind of interrupt has occurred. The processor uses this number to index into a list of interrupt descriptors maintained by the OS. Each descriptor tells the processor where the corresponding interrupt handler is located. The processor fetches the handler and then executes it. Once the handler has finished dealing with the situation, the processor restores the original program’s saved state and carries on from where it left off.

Thanks to this hardware addition we can deal with unresponsive programs. By triggering an interrupt on a timer, we can regularly suspend a running program and give the OS a chance to run. The unresponsive program won’t be on the processor forever. The OS will get at least some processor time which it can use to check the status of all running programs and terminate any unresponsive ones. Because the timer is triggered by hardware, it can’t be blocked by a malicious or incompetent user program.

The kernel

The OS has power that other programs don’t have. Thanks to interrupts and its privileged mode, it will always regain control of the processor and can stop any other program. Your browser can’t arbitrarily terminate your text editor but your OS can. With this power comes great responsibility. In particular, it’s the responsibility of the OS to not have critical design flaws or bugs that can be exploited to allow unprivileged code to run at the OS’s privilege level. This means that the huge range of services and features offered by the modern OS becomes a bit of a security risk. If the whole OS runs with super special privileges, then a bug in some obscure component, such as a font manager, could offer an attacker a way to gain privileged access to the whole system.

The solution to this problem (apart from reviewing OS code very, very carefully) is to divide the OS into two parts: a small kernel of trusted code that runs with full privileges and a bigger userspace that runs in unprivileged user mode. Supervisor mode is often known as kernel mode for this reason. The only code that goes into the kernel is code that absolutely needs to have higher privileges in order to do its work. The kernel is the core functionality around which everything else is built. Less important programs, such as our poor font manager, can be pushed out into userspace. A bug in a rogue userspace program is obviously still undesirable but doesn’t entail the same security risk.

It’s common to see these modes referred to as rings. In the ring model, kernel mode is ring 0, the innermost. Around that are rings of decreasing privilege levels. Userspace corresponds to ring 3. Rings 1 and 2 are less common and mainly used by specialised programs such as device drivers. Since only ring 0 and ring 3 are regularly used, I personally find it easier to just think of them as kernel mode and user mode.

The basic rule of thumb is that a program should run in userspace unless it absolutely has to run in the kernel.

Gates and system calls

What happens when a userspace program legitimately needs to do something that can only be done in kernel mode? There must be a secure way to move between rings. If you imagine the rings as giant walls arranged concentrically around a vulnerable core, we need ways through the walls. To stop unauthorised access attempts, we also need some kind of stern bouncer deciding who can go through. On x86 processors, the interrupt descriptors are known as gates. The gate’s interrupt handler is the stern bouncer deciding whether or not to permit the access request.

As well as the interrupt handler’s address, each interrupt descriptor specifies in which mode the handler should be executed. Specifying kernel mode provides a secure way to switch between user mode and kernel mode. It’s secure because we can’t just jump into kernel mode as we please – we can only go through the gate. The user program has no control over what the handler does or whether it chooses to permit the requested action. The handler in each gate decides whether or not to permit the operation requested by the user program. It’s just like trying to get into a nightclub. The bouncer decides whether you get in and doesn’t have to explain their decision to you. Trying to argue doesn’t end well.

The OS utilises this hardware support to provide special system calls to allow user programs to request services from the kernel. For example, a user program cannot write data directly to the hard drive wherever and whenever it pleases. If this were permitted, a poorly written program could overwrite critical system files. Instead, the OS provides a set of system calls for interacting with the file system. The user program must use the appropriate system call to request that data written to the desired file. The kernel checks that the request is valid, performs the operation and then reports back on its success or failure to the user program. Only the kernel actually interacts with and modifies the file system and disk. The user program only knows whether its request succeeded and what errors, if any, occurred.

Due to the fact that system calls need to be able to switch into kernel mode, their implementation is very low level (often in assembly) and specific to the processor architecture. The OS therefore provides a C library of wrapper functions as an interface between the system calls and userspace code. The requesting program first calls the C wrapper function. It also runs in userspace. The wrapper function calls the actual system call, which uses a gate to transfer to kernel mode and perform the requested operation.

Each OS handles the details differently. Linux defines one interrupt, 0x80, to be for system calls. Each system call is additionally assigned a unique syscall identifier. Invoking a system call entails raising an 0x80 interrupt and passing the syscall identifier and any arguments to the processor. The processor uses the interrupt identifier 0x80 to find the interrupt handler. The interrupt handler, running in kernel mode, examines the syscall identifier to identify the correct syscall handler. We have two layers, one hardware and one software: the interrupt identifier is used by the processor’s hardware to identify the interrupt as a system call and then the syscall identifier is used by kernel code to identify the particular system call. The processor doesn’t care about the syscall identifier – that’s a software matter for the kernel.

On Linux, you can find lots of information about the available system calls in the manual pages (man syscalls). They will give you an overview of the available system calls, their identifiers and the arguments they accept. I recommend skimming through this to get a sense of the functionality offered by system calls.

The process I’ve described here is technically referred to as the legacy Linux approach for 32-bit architectures. Switching between CPU modes is comparatively expensive because the processor has to store the state of the current program and then restore it. The modern Linux kernel on 64-bit architectures uses a faster method that avoids the mode switch at the cost of greater complexity. I have avoided describing this method because, though conceptually similar, the additional complexity obscures the clean distinction between user mode and kernel mode.

What we have here is a neat separation of concerns. The system calls act as an interface defining the services available to userspace programs. The user program has no knowledge of or control over how each system call is implemented. This greatly reduces the complexity of interacting with the system. Any program written for Linux knows that it can request to read a file by using the correct system call. This will be the same across all machines running Linux. The program doesn’t have to worry about interacting directly with the hard drive or whether it needs to use this or that gate. All of the complexity is hidden away behind the system call and dealt with by the OS. Security is improved because the user program cannot tamper with the system call implementation. Of course, we’re still reliant on the kernel code being written correctly in the first place but at least we only have to verify the correctness of the system calls and their wrappers. We have successfully limited how much damage dangerous user code can do.

Managing the processor

The operating system is a special program that starts up the system and orchestrates every other program, jumping in whenever the processor alerts it to naughty behaviour. How does it do this?

Normally we think of running a program as an activity. It’s something that happens. A process is a clever abstraction that turns this activity into an object that can be moved around and manipulated. A process is an instance of an executing program. It includes the program’s instructions, any memory used by the program, the processor’s flags and registers (including the program counter), any files the program is using and a few other bits and bobs. A process is a complete description of a particular point in an instance of the program’s execution. When we talk about “running a program” we should more precisely say “running a process”. The OS runs programs by loading them into processes, so it is possible to have multiple instances of the same program running at the same time in different processes.

When a program runs, it follows a particular route through the code. A conditional sends it here, a jump sends it over there. Perhaps it’ll take a different path the next time it runs, depending on how its execution context and user input change its behaviour. If we had two separate program counters, we could have two simultaneous but separate threads of execution through the instructions. Every process contains at least one thread and modern processors include hardware support for multiple threads within the same process. By using multiple threads, it’s possible to handle different paths through the code at the same time. This is known as concurrency and we will explore it in much more detail in a later chapter.

In Linux, processes are represented by process descriptors. These are just big, fat data structures that hold all of the above information. The kernel maintains a list of descriptors, known as the task list, that tracks all of the system’s processes. Processes are uniquely identified by their process ID (PID). To learn about currently running processes, we can use the ps utility (-c tidies the output and -f provides some extra information):

1 $ ps -cf
2   UID   PID  PPID   C STIME   TTY           TIME CMD
3   501 51112 51110   0  6:14PM ttys000    0:00.03 iTerm2
4   501 51114 51113   0  6:14PM ttys000    0:00.21 -bash

You can see immediately that I have a Bash shell running in a terminal (iTerm 2). UID refers to the ID of the owning user (i.e. me) and we know PID. But what is PPID? Every process is created by a parent process. PPID refers to the parent’s PID. Processes are like bacteria: they reproduce by splitting themselves in two. When a process wants to create another process, it has to fork itself (via a system call, of course). The OS creates a duplicate of the process that differs only by its PID and PPID. If process 123 forks itself, the new child process will have a new PID – let’s say 124 – and a PPID of 123. The child process is emptied out and a new program loaded into it. We end up with a child process running an entirely different program. This is how the OS handles starting new programs.

If I want to find the process that started my terminal, I need to search for the process with a PID matching the shell’s PPID:

1 $ ps -cf -p51110
2 UID   PID  PPID   C STIME   TTY           TIME CMD
3 501 51110     1   0  6:14PM ??         0:40.86 iTerm2

It turns out that iTerm 2 created process 51110 too. Apparently, iTerm 2 creates a new process for each tab it opens. What initialised the original iTerm2 process? Let’s have a look at this process 1:

1 $ ps -cf -p1
2 UID   PID  PPID   C STIME   TTY           TIME CMD
3   0     1     0   0 Tue05PM ??         1:53.20 launchd

What is this new devilry? We have a process with no parent PID! How is this possible? If every child has a parent then what is the first parent? PID 1 belongs to a special program known as init which is directly initialised by the kernel during the boot process. It’s the first process to run and is responsible for starting every other system process. On my macOS system, the program is called launchd and it has started iTerm 2 because I configured macOS to run iTerm 2 on startup. On Linux, the exact command depends on the distribution though systemd is the most popular (and the most unpopular).

In the Unix model, processes are modelled as pipes. User input goes in one end, known as standard input, and output comes out the other, known as standard output (there is also standard error for error output). To communicate with a process without sending it input, the OS needs a separate communication channel. The OS can send signals to a process. A signal is a number indicating what the OS wants the process to do. Most signals have defined meanings that the process should respect.

For example, when you terminate a process by typing Ctrl-c on the terminal, you are actually sending the process a signal known as SIGINT (interrupt). The idea is that the program will receive this signal and know that it’s time to gracefully shut itself down, perhaps writing any state to disk and closing any resources its opened. There’s nothing that compels a program to respond in this way, however. A pathologically deviant programmer could respond to a SIGINT by doing whatever they want. Only convention and good grace prevent anarchy. Misbehaving programs can be terminated by using the kill command. Standard usage will send a SIGTERM (terminate) signal, which is more forceful than SIGINT but still fairly polite way. If the process is still recalcitrant, you can bring out the big guns in the form of kill -9, which sends a SIGKILL (kill) signal which immediately terminates the program without allowing it to do any clean up. The flag is -9 because SIGKILL is assigned that number. You can read about available signals by calling man signal.

Processes communicate with the OS via return codes. They work just like the return values of ordinary functions. Normally, of course, the user is more interested in the side effects of the program’s execution – output to the screen, files changed and so on – so the return code is mostly used to check whether a program ran without errors. A return code of 0 conventionally indicates success and a non-zero value indicates some kind of error. In a Bash terminal you can query the return code of the last command using $?:

1 $ touch file.txt
2 $ echo $?
3 0 # touch exited successfully
4 $ rm incorrect_name.txt
5 rm: incorrect_name.txt: No such file or directory
6 $ echo $?
7 1 # an error occurred

The exact meaning of each return code depends on the program. The man pages for rm are rather unhelpful:

If an error occurs, rm exits with a value >0.

Scheduling processes

Processes give the OS the ability to stop, pause and restart executing programs. Swapping between multiple processes is known as multitasking. Time available on the processor is divided into slices. Each process runs for a few time slices and is then swapped out so that another process can run. From the perspective of the processor, it’s sequentially executing a succession of different processes. From the point of view of a human user with its slow, meaty brain, the switching between processes is so rapid that all of the processes appear to run at the same time. It’s just an illusion, of course, but it’s very convincing when your computer appears to be playing music, rendering a moving cursor, loading a website and compiling some code all at once. When a computer appears to perform multiple tasks at the same time, we say that it is performing them concurrently. In the concurrency chapter we’ll look at what this means for programming. In this section, we’ll confine ourselves to how the OS implements concurrency.

The act of stopping one process, storing its state and loading another is called a context switch. Performing a context switch is a comparatively slow operation because the OS needs to save absolutely everything belonging to the process and then recreate it identically. When the OS initialises a new process, execution begins afresh from the first instruction. After running for a while, the process’ state will have changed to reflect the effects of the computation. The program counter will have a new value, for example, and maybe the process opened a file or two and updated some values in memory. When the OS performs a context switch, it must record the information it needs to pick up execution from exactly where it left off. Since the process isn’t computing anything when not on the processor, it won’t even be aware that it’s been suspended, just as we aren’t aware of anything when we’re asleep. From its perspective, it has exclusive use of the processor and the only slightly unusual thing is that the clock time suddenly jumps forwards every now and then, indicating the time during which it was suspended.

Programmers like to moan about “context switches” when they are asked to switch the task they’re working on. Building up a mental model of a particular problem often takes some time and switching frequently between tasks means you have to continually rebuild your mental model for each task. Either that or it’s just an elaborate excuse to not have to deal with tech support issues. Who knows!

To recap, a multitasking OS will divide the processor’s execution time among multiple processes, each taking it in turns to run for a short period. This raises a couple of thorny questions: which processes should run, in what order and for how long? The part of the OS responsible for answering these questions is known as the scheduler.

From the OS designer’s perspective, the easiest thing is to just leave it up to the programs to sort it out amongst themselves. The OS can sit back and watch as programs voluntarily cede control to each other in a polite and considerate manner. This approach, known as cooperative multitasking, was used on early versions of Windows and Mac OSes. Sadly, this approach is akin to expecting small children to voluntarily cede control of toys. One badly- or maliciously-written program could behave like a troublesome two year old and block the whole system by refusing to cede control. Just as adults sometimes need to step in and put someone on the naughty chair, cooperative multitasking has been replaced by a more interventionist approach from the OS known as preemptive multitasking. In a preemptive system, a periodic interrupt suspends the current process and invokes the scheduler, which can decide whether to keep the current process going or switch to another.

To implement multitasking, the scheduler needs the concept of process state to describe whether a process is currently running and whether it is runnable. The precise states vary between scheduler implementations but they generally follow this simple model:

A running process is currently executing on the processor.
A waiting process is ready to run but hasn’t yet been loaded on to the processor.
A blocked process is unable to run because it needs some resource to become available before it can proceed.
A terminated process has finished executing or has been sent a signal by the OS telling it to stop.

The state machine looks like this:

This model enables the OS to make much more efficient use of system resources by minimising dead time waiting for resources to become available. At any given moment, a process can have one of two constraints preventing it from successfully completing. It might have instructions that still need to be executed by the processor. Such processes are CPU-bound because the limiting factor is time on the CPU. A process might otherwise be waiting for some kind of input/output (I/O) action to complete such as reading from memory or disk, getting user input or sending something over a network. Such processes are I/O-bound because they can’t continue until the I/O action completes.

If the scheduler puts an I/O-bound process into a blocked state until the I/O request completes, it can use the time to run CPU-bound processes. Once the I/O action is complete, the process is unblocked and ready to continue. The process is now CPU-bound because the only thing stopping it from continuing is lack of CPU time. The scheduler will eventually schedule it to run on the processor. If, while executing, the process becomes I/O-bound again, the scheduler just repeats the process. This approach is preemptive because the scheduler takes control of when and for how long each process runs.

This is works so well because, as we saw in the previous chapter, the processor is much, much faster than everything else. I/O requests are several orders of magnitude slower than CPU operations. It’s clearly undesirable for the processor to sit idle while it waits for an I/O operation to complete. Rather than waste huge numbers of CPU cycles, the scheduler will unload the blocked process and load a CPU-bound one instead. The aim is keep the processor’s utilisation rate as high as possible. The computer can keep doing meaningful work while it waits for the I/O to complete.

A preemptive scheduler might decide to swap out a running CPU-bound process, even though the process would have been able to continue executing, in order to give other processes a slice of processor time. This avoids one process hogging all of the CPU time and starving other processes by not letting them have any CPU time at all. Even among CPU-bound tasks some might be higher priority than others. Decoding a streaming video is time sensitive and so higher priority than a background task. Scheduling lots of processes in a way that’s considered fair is a challenging and active area of research.

Managing memory

We’ve seen how the OS forms a protective layer above the physical hardware. Nowhere is this more evident than in how memory is handled. The OS and processor work closely together to fool user processes into thinking that they have the computer’s entire address space to themselves. Every memory address that a user process sees and uses is actually fake. This deceit goes by the name of virtual memory.

The OS allocates each process a contiguous block of virtual memory addresses known as the virtual address space. This seemingly contiguous block is actually cobbled together by the OS from unused chunks of real memory. The OS and processor map the virtual addresses to the real addresses in a way that is completely invisible to the process. To do this efficiently is difficult and requires an entirely new subsystem in the processor: the memory management unit (MMU). Why go to all this bother? Why not just give the process real addresses?

Virtual memory is a bit like a well-meaning lie. It makes things much easier for the process if it believes that it has access to a nice, big empty block of memory. If the process could look behind the veil and see the true state of affairs, it would be appalled. It would see that the OS had actually allocated it many chunks of memory distributed across the address space. The process would have to keep track of the size and location of each chunk, which ones are filled, which addresses it controls, which it doesn’t and lots of other irritating considerations. It’s far easier to take the blue pill and continue living in the fake world constructed by the operating system. I’m sure we all know the feeling.

Virtual memory is also desirable because it leads to more efficient use of available memory. If a program requests, say, 100MB of memory, the OS doesn’t need to find a single, 100MB chunk. It can combine smaller sections of free space that maybe would otherwise be too small to allocate. This reduces the amount of memory fragmentation, where the blocks of used memory are interspersed with lots of unusably small bits of free space.

When the OS converts a virtual address into a physical address, it also has the opportunity to check that the requesting process has permission to read the requested location. After all, we don’t want a process reading memory that belongs to the kernel or to a different process. This is known as process isolation and is another example of how the OS acts as a gatekeeper, protecting the system’s security.

Finally, virtual memory allows the process to make simplifying assumptions about its address space. Each process will think that it has access to the same address range every time it runs. Programs can therefore make assumptions about where their resources will be located in memory because they will always have the same virtual addresses. Without this it would be much more difficult for a program to locate all of its resources.

The OS deals in chunks of memory known as pages. Usually a page contains a few kilobytes of memory. The process’s address space will be made up of multiple pages. The total amount of virtual memory allocated in pages can even exceed the amount of physical memory in the machine. Not every physical page backing a virtual page needs to be actually held in memory. The OS can swap unused pages out of memory and on to the hard disk. Imagine that a process has been allocated 50 virtual pages but there is only space in physical memory for 30. The remainder will have to be stored on the disk. Whenever the process attempts to read a virtual address and the processor’s MMU can’t find the backing page in memory, it will trigger a page fault. This is a type of exception that tells the OS to swap the required page back into memory, swapping out some other page to make room if need be. The OS repeats the read attempt, which will now succeed since the page is in memory. The process is completely unaware that this is going on. All of the work is handled by the OS. From the process’s point of view, it has access to all 50 pages of memory. Some memory operations will simply appear to take a little longer than others. It is blissfully unaware that the OS is busily swapping pages in and out of memory. In effect, the OS treats main memory as a giant cache for the hard drive.

Having studied the memory hierarchy, we are painfully aware that read operations to the hard disk will be much, much slower than reads to an in-memory page. When a computer has very limited physical memory available it can get itself into a terrible situation known as thrashing. The OS is forced to constantly swap pages back and forth between memory and the hard disk. This takes so much time that the computer doesn’t have time to do anything else. This is why you might have noticed your computer slowing to a crawl before showing an out of memory error. Modern paging systems are very sophisticated and make heavy use of optimisations, including hardware caches in the MMU, to try and avoid this happening.

The process address space

When a process forks itself, the child process initially has a copy of the parent’s (virtual) address space. In order to execute a different program, we need to set up a fresh address space. The OS loads the new program’s code into memory and sets up the address space. The precise structure of the process address space will vary slightly from OS to OS, but the general approach doesn’t change much. A program’s executable is a combination of machine instructions and data. The kernel loads the different elements into various segments that make up the address space:

The program’s instructions (in machine code form) are loaded into the text segment. The instructions are the program “text” that the processor will read and execute. Once everything else has been loaded, the OS will set the program counter to the address of the first instruction and off we go.

All but the most trivial programs will contain variables. Initialised variables are those that have a value hardcoded in the program code. They live in the data segment. Uninitialised variables are just variables without an initial value. They live in the snappily-named block started by symbol (BSS) segment. The reason for keeping them separate from the initialised variables is so that the kernel knows that they have to be initialised to zero when the program starts. If they were stored in the data segment, they’d need to include their zero values in the program code, which would be a waste of space. To easily remember the purpose of the BSS segment, just say to yourself “better save space!”.

The stack segment is the process’s working space. You might remember from the data structures chapter that a stack is a last-in-first-out (LIFO) data structure. Whenever a function is called, a structure called a stack frame is added to the top of the stack. It contains everything that function call needs: the function’s arguments, any local variables and the return address.

An executing function has access to its own stack frame and the frames of enclosing functions. All of the variables in these frames are in the function’s scope. The frames below the current frame in the stack would have been created by the function’s ancestors: the parent function that called the current one, the function that called the parent function and so on back to the very first function. You’ve probably seen a stack trace, which is logged when an unhandled exception occurs. The stack trace shows the complete set of function calls that resulted in the current, exception triggering function call.

When a function returns, its stack frame is removed from the top of the stack. This is why you can’t access a function’s local variables from outside of the function. When we return from a function and go back to the parent scope, the child’s stack frame is removed and its variables become unavailable. Here is an example in JavaScript. Note that JavaScript code executes in a runtime environment that provides the JavaScript program a separate stack but the principle is the same:

 1 function run() {
 2   let A = 'a';
 3   let B, C;
 4 
 5   console.log(A, B, C); // 1
 6 
 7   function inner() {
 8     let B = 'b';
 9     C = 'c';
10     console.log(A, B, C); // 2
11   }
12 
13   inner();
14   console.log(A, B, C); // 3
15 }
16 
17 run();
18 
19 // output:
20 a undefined undefined
21 a b c
22 a undefined c

When run is called, it declares three variables: A, B and C. It only assigns a value to A. At the first comment B and C have undefined values. Next, we call inner, which assigns a value to C and declares its own, local version of B and gives it a value. At the second comment all three variables have values. However, the B with a value is local to inner. It only exists in inner’s stack frame. It shadows, or masks, the undefined B in the outer scope. The assignment to C, on the other hand, is an assignment to the original C in run’s stack frame – it doesn’t create a local variable. When we return from inner, its stack frame is destroyed and we lose all knowledge that the local B ever existed. At the third comment we can see the change to C has persisted but the B in run’s stack frame remains undefined.

A problem with the stack is that the amount of space allocated to local variables in each stack frame is hardcoded into the program’s source code. This means that they have to be known when the program is written. What do we do if the size of the variable will only be known when the program runs? For example, imagine that we write a C function that asks the user to enter their name. The user’s input will be written into an array declared in the function. How can we allocate enough space for the array if we don’t know how long the name will be?

One (pretty poor) solution would be to decree that no name may be longer than 100 characters and to allocate space for an array of 100 characters on the stack. To avoid buffer overflows, we’ll need to check that the input length is less than 100 and discard anything extra. Forgetting to do this would be a major security hole because C will let you write past the end of the array’s allocated space on the stack, overwriting anything following the array. This is a problem because it could mean overwriting important things such as the function’s return address. An attacker could craft an input that runs past the end of the array and replaces the return address with the address of some malicious code, causing the computer to start executing it when the function returns.

Security concerns aside, what about users who happen to have very long names? Clearly our approach is far from ideal. The better solution would be for the program to keep track of how full the array is and increase the size when necessary. It can’t do this in the stack frame because the array is already surrounded by other values. The process address space has an area known as the heap for just this kind of thing. The process can request (via an OS system call) chunks of memory that are allocated from the heap. This is known as dynamic allocation because the amount allocated is determined by the running process rather than being hardcoded into its instructions, as in the case of the stack.

In a language like C, which provides no memory management, it’s up to you to keep track of what’s on the stack and what you’re allocating on the heap. Getting memory management right is notoriously tricky and a huge source of bugs. It’s therefore very common for languages to offer some kind of automatic memory management. When you run a Ruby program, the Ruby runtime is in charge of deciding where to allocate variables and performing all the allocations and deallocations. Writing some simple programs in C is a good way to improve your understanding of how the stack and heap both work.

In terms of address space layout, it’s important to notice that the stack grows down and the heap grows up. The first element of the stack lives at a high memory address. As the stack grows, the addresses of the frames decrease. Both the stack and heap need space to grow dynamically as the program runs. By having one grow up and other grow down, they can both grow into the empty space in the middle. Of course, it’s still possible for the stack to extend so far downwards that it uses up all of the available space and butts up against the memory mapping segment. This is known as a stack overflow and usually happens because an incorrect use of recursion creates too many stack frames.

The random offsets you see in the diagram implement address space layout randomisation (ASLR). If the stack and heap always begin from exactly the same location, it would be easy for a malicious program to work out where particular variables (e.g. passwords) will be. Adding randomly sized offsets makes it harder for an attacking user to guess their locations.

Managing persistence

Any data that the user wants to survive a system power-off needs to be written to persistent storage. Mechanical hard disks commonly provide large capacities and relatively slow performance. In recent years solid-state disks (SSD) have proven popular as a much faster, albeit less capacious, form of persistent storage. Usually persistent storage, either mechanical or SSD, forms the bottom tier of the memory hierarchy. The storage device will typically present itself to the system as a flat expanse of space. It is up to the OS to impose structure.

The OS subsystem responsible for organising persistent storage is known as the file system. A file system groups related data into a single unit known as a file. The file system will provide some way of structuring files and expressing their relations to one another. On Unix-like systems, the OS creates a virtual file system, organised as a hierarchical tree with / at the root of the tree. Files path, which therefore always begin with / on Unix-like systems, indicate the route to a unique resource in the hierarchy. The file system is “virtual” because sub-trees may contain different concrete file systems from different storage devices.

Operating systems support many file systems offering competing balances of performance, security and reliability. Due to space constraints I’m not going to cover file systems in any detail. As ever, resources are available in further reading should you wish to study them further.

There are two points I do want to cover. Firstly, a process accesses each file through a file descriptor: a reference to a per-process data structure that records the state of the process’ interactions with the file. On a Unix-like system, every process starts with three file descriptors representing standard input, standard output and standard error. File descriptors are a limited system resource. It’s a common mistake to forget to close a file descriptor after opening. Eventually the OS will run out of file descriptors, probably causing your program to crash unless you handle this possibility. Some languages, such as Python, include constructs which automatically close a descriptor when it’s no longer in use.

Secondly, the huge speed differential between the CPU and persistent storage devices means that caching is used heavily. To improve performance, the OS will not write a change directly to persistent storage. Instead, it will update the in-memory page and only write the page to disk when the page needs to be removed from the cache. The benefit is that multiple writes to the page will be batched together and only one write to persistent storage is needed. Storage devices usually have their own internal caches, completely invisible to the OS, as a further performance improvement.

This caching can create problems for some applications, particular databases, that need to be sure that a write has actually been written to persistent storage. By default the OS will consider the write to have succeeded once the in-memory page has been updated, even though the persistent storage hasn’t yet been updated. Should a system failure occur at that point, the supposedly successful write would be lost and the database would have an incorrect understanding of what is in persistent storage. Similar problems arise if the storage device says that the write has succeeded when it has only reached the internal cache. To handle such cases where performance is secondary to reliability, OSes and storage device drivers usually provide the ability to flush writes directly to persistent storage.

Conclusion

In this chapter we saw how the operating system forms an intermediary layer between the hardware and the user’s programs. The job of the OS is to protect the hardware, present a consistent execution environment and improve performance. To achieve these objectives, OSes rely on hardware safety mechanisms provided by the processor. The kernel is a trusted, privileged program that has complete control over the system’s resources. It orchestrates and supervises the execution of untrusted, unprivileged user programs. The OS mediates all attempts to access system resources through the system call interface, supported by the interrupt mechanism in hardware.

To manage resources effectively, the OS creates virtualised representations of these resources. In each case the purpose is to abstract away the hardware specific details and provide a consistent, secure and easy-to-use interface to user programs. A process is a representation of an executing program. All computation happens within a process, allowing the OS to schedule programs to permit concurrency and optimal resource usage. Virtual memory is a virtualised process address space. It allows the OS to efficiently allocate memory while also presenting a simple, consistent address space to the process. The process address space is split into different segments. Each function call generates a stack frame. The stack therefore represents the current state of the program’s execution. The heap is used for dynamically allocated values. A file system is a representation of how data is arranged and stored in persistent storage.

We have reached an important milestone on our journey. Over the course of the first four chapters we developed our mental model of how a computing system functions. We understand every element right down to the logical bedrock. In the rest of the book we will explore what we can do with this platform.