book page

Overview

Compilers

Introduction

We come to the final chapter of The Computer Science Book. Compilers are well worth studying for two reasons: understanding how your code is transformed and executed is an essential skill for high performing developers. Secondly, truly understanding how a compiler works shows that you have a good understanding of computer science as a whole. Parsing source code draws on the automata and formal language theory we saw in theory of computation as well as the semantics of programming languages and type systems. Modelling and analysing program structures uses many of the algorithms and data structures we studied earlier. Optimising a program and generating suitable machine code requires a solid understanding of the target’s system architecture and operating system conventions. Compilers are frequently taught at the end of computer science courses because they are the perfect capstone topic.

We didn’t always have compilers. The first computer programs were written directly in machine code. Grace Hopper in the 1950s developed the A-0 system, which could convert a form of mathematical code into machine code. She thus pioneered the idea that programs could be expressed in a more natural form and mechanically translated by the computer into its own machine language. That is the idea at the core of compilation. Further work in the 1950s and 1960s led to the first high level languages such as FORTRAN, COBOL and LISP (uppercase letters were in abundance in those days). These languages abstracted away the low level hardware details and allowed the programmer to focus more on the high level architecture and logic of their programs.

Today, the vast majority of programmers rely on compilers or their close cousin, interpreters. We generally prefer to write in more expressive, high level languages and let the computer itself convert our beautiful code into the bits and bytes that the machine actually understands. It used to be the case that hand-written assembly or machine code would be better than a compiler’s output. Nowadays, optimising compilers apply a large suite of sophisticated optimisations. Their output can approach or even exceed the performance of hand-written code.

A modern, optimising compiler is a complex beast. It represents the culmination of decades of research and development. Don’t let their complexity intimidate you. In this chapter I hope to demystify the compilation process while instilling a respect and admiration for their capabilities. Popular compilers include Visual Studio on Windows, GCC (the GNU Compiler Collection) and the LLVM platform. Even if you don’t regularly use these compilers, many of their ideas and techniques have been incorporated into the advanced JavaScript engines found in modern browsers.

Compilation and interpretation

Compilation and interpretation are two distinct but closely related processes. In general terms, both processes take a high level representation of a program (i.e. its source code) and prepare it for execution. A compiler takes the entire input program and converts it to a low level representation that can be executed later. An interpreter takes the program and executes it directly, incrementally converting the program as it goes. When you compile a program, you are not running the program but instead preparing it to be run later. When you interpret a program, on the other hand, you actually run the program immediately. Interpreters are therefore unable to catch bugs before the program starts. Interpreted programs have a faster start up time but generally aren’t as fast as compiled programs.

When the low level representation is raw machine code, we call the compiler output object code. Blobs of object code are then combined according to the operating system’s specifications to create an executable file, known as a binary, that is ready to load into memory and execute. This is how C and C++ approach compilation. The compiler must “target” a specific architecture and operating system so that it can use the correct conventions, instruction sets and so on. The compiled output therefore performs optimally but is tightly bound to the target architecture. Binaries are not portable: a binary compiled for a Linux system won’t work on a macOS system and vice versa. This is why you see popular programs offering different builds for various system architectures.

Another approach is for the compiler to output assembly-like bytecode that is executed by another program known as a virtual machine. This is the approach Java takes. The Java compiler converts Java source code to a simpler but semantically identical format known as Java bytecode. To run the program, we must pass the bytecode to the Java virtual machine (JVM), which can run Java bytecode. This two-step approach might seem unnecessarily complex. The purpose is to make compiled code more portable. Any Java bytecode can be executed on any system with a JVM implementation. There is no need to compile multiple versions for each supported architecture. Virtual machines are so called because they create a software-based, virtual execution environment. It is the job of the VM implementation to map the virtual machine’s semantics to the actual hardware. In doing so, it abstracts away the details of the underlying hardware and creates a consistent execution environment. Not all interpreters are virtual machines but many interpreters use virtual machines for these reasons.

To complicate things yet further, compilation can happen at different points. The examples so far are all ahead-of-time compilation. The developer writes their code, compiles it and then distributes it in executable format (either binary or bytecode). Compilation happens before the program runs. The benefit of this approach is that the compilation work is only done once. The developer can distribute their program to many users who can directly execute the program without having to first compile it themselves. The downside of ahead-of-time compilation is that compilation can take a long time, especially if the compiler is trying to generate optimised output. If the source code changes frequently then having to regularly recompile can really harm developer productivity.

It seems that we can have either fast start up time or fast execution but not both. This is a particularly acute problem for web browsers. JavaScript is served as source code and needs to start up quickly, which would favour interpretation, but many websites expect JavaScript to handle computationally intensive tasks, which would favour optimised, compiled code. An increasingly popular solution is known as just-in-time (JIT) compilation. This attempts to square the circle by starting the program in an interpreter and then compiling the program to faster machine code as it is running. Usually the compiler doesn’t attempt to compile the entire program but instead monitors which code blocks are executed most frequently and compiles them incrementally. By observing the behaviour of the running program, the compiler might even be able to generate more optimised code than it could ahead of time. JIT compilation is common in JVM implementations and modern browser JavaScript engines.

Techniques such as JIT compilation blur the boundaries between compilation and interpretation. The water is further muddied by the fact that for performance reasons many “interpreted” languages, such as Python and Ruby, are actually compiled to bytecode and then interpreted in a virtual machine! If you find yourself getting muddled, just hold on to a simple guideline: compilation prepares a program for future execution, interpretation executes the program directly. To avoid confusion, in this chapter we’ll only consider a straightforward, ahead-of-time compiler targeting a specific architecture.

The program life-cycle

Computer programs go through a series of stages known as the program life-cycle. The earlier in the life-cycle we can detect and fix bugs, the better. When caught earlier, bugs are more likely to appear close to their source and will be easier to find and correct.

The life-cycle of a compiled program

Edit time is when the developer is writing or modifying the program’s source code. Tools like linters and code formatters are able to statically analyse the source code and detect some types of errors. Here “static” means that the tool doesn’t actually execute the program. It merely analyses the source code.

The program enters compile time when it is being compiled, naturally enough. The compiler performs its own static analysis to ensure that the input is syntactically valid and well-typed. The compiler operates on chunks of code known as translation units. Depending on the language’s semantics, a translation unit might be an individual file, a class or a module. As we’ll cover in detail below, the compiler builds up an in-memory representation of the input’s structure and semantics and generates suitable output code (e.g. object code, bytecode). Depending on the semantics of the language (particularly its type system), the compiler may be able to detect many bugs at this stage including invalid syntax, incorrect types or even more subtle bugs such as non-exhaustive handling of case statements.

By itself object code can’t be directly executed. Each blob of object code is just a sequence of machine code instructions implementing the required functionality. If the code references other functions, global variables and so on, the executable code will need to know how to find them. For example, if I import a function from another translation unit, the object code will contain references to the function but will not have the function’s code or the address of its implementation. At link time a program known as the linker combines the individual blobs of object code into one, coherent executable bundle. It will lay out the object code blobs and write in missing addresses. Often the linker is a different program but is invoked automatically by the compiler. Linking is a fairly mechanical process, so we won’t look at it in great detail.

There are different ways of handling linking. Static linking is where the linker copies library functions into the binary. The Go compiler statically links the entire Go runtime into every binary. This leads to bigger binary sizes but makes it easier to deploy the binary because it’s self-contained. Alternatively, dynamic linking defers linking until runtime. The linker relies on the OS to load the necessary libraries into memory along with the program’s binary code. A common example of this approach is the C standard library. Since the C standard library is so frequently used, it makes more sense for the OS to make one copy accessible to all programs than to duplicate it in every C binary. Dynamic linking saves on binary sizes and means that improvements in a linked library are immediately available to every program linking to that library. However, it can lead to confusing portability issues when a program is compiled against one version of the library and another one is available at runtime.

Finally, runtime is when the program is actually loaded into memory and executing. Errors occurring now need to be handled gracefully in some way or they could cause the whole program to crash. Even worse, an error might not cause a crash but instead cause incorrect behaviour, possibly propagating far from the original site of the bug. This is not ideal for anyone. The runtime environment is the available functionality not directly provided by the running program. The richness of the runtime environment depends on the language and how the program is executed. At one extreme, the runtime environment might be an entire virtual machine and its interface to the physical system. Go’s runtime provides a lot of complex functionality such as automatic memory management and cooperative goroutine scheduling. C, at the other extreme, is pretty lazy and leaves all of that work to the programmer. The C runtime is therefore just a standard library of useful functions and a little bit of platform-specific code to handle passing control to and from the kernel.

Building a compiler

It’s tempting to think of compilers as somehow magical programs that reach deep into the depths of computation itself in order to give life to our squalid source code. In reality they work just the same as any other program: they take input, process it and generate output. What is special about compilers is that their input and output are other programs. High level source code goes in one end and machine code comes out the other.

As ever in computing (and life!), when we face a complex task it is helpful to break it down into simpler sub-tasks. Compilers are generally structured as a pipeline of processing steps. Each step is responsible for a different part of the compilation process. The front end of the pipeline works directly with the incoming source code. It uses its understanding of the language’s syntax and semantics to generate a language-agnostic, compiler-specific intermediate representation (IR) of the program. The front end encodes into the IR all of the knowledge it can glean from the source code. The middle end (or optimiser) takes the IR from the front end and applies a sequence of optimisations. This is where a lot of the super cool stuff goes on. Each optimisation involves analysing the IR to find areas ripe for optimisation and then manipulating the IR into a more efficient form. The back end is responsible for converting the optimised IR into target-specific machine code.

The compiler pipeline

Each pipeline step should ideally be independent of the others, communicating only by passing IR to later steps. The motivation for this architecture is not just a clean separation of concerns. Code reuse is also extremely important. Imagine that you want to compile m languages on n architectures. If each language-architecture pair needed its own compiler, you’d need to write m * n compilers. The pipeline approach avoids lots of this work. We only need to write a front end for each language and a back end for each architecture, so m + n in total. The optimising middle end can be reused by every language-architecture pair.

Avoiding m * n compilers by using IR

The LLVM compiler project is a great example of this approach. It is structured as a low level, compiling virtual machine with back ends for many targets. Using LLVM is an easy way for language designers to have their language supported on a wide range of platforms. They only need to write a single front end that compiles their language into LLVM IR to gain access to all of the optimisations and targets supported by LLVM. Innovations and improvements to LLVM benefit all languages using it.

The front end

The front end converts source code into the compiler’s IR.

The front end pipeline

The input will be a flat sequence of characters in a file. The front end first splits the character sequence into its constituent parts in a process known as lexical analysis. It then parses the input to figure out its structure. Finally it performs semantic analysis to check that the program structure makes sense according to the language’s semantics. All of the information captured by the front end is encoded into IR.

Throughout this section, I’ll demonstrate how these processes work on a fragment of JSON:

1 { "names": ["Tom", "Tim", "Tam"] }

Decoding a JSON string into an object is similar to parsing source code into IR.

Lexical analysis

The front end’s work begins with lexical analysis (also known as lexing or scanning), which is performed by the lexer. The source code input will be a long sequence of characters with no clear sub-divisions or structure. Lexical analysis is the process of breaking the input into individually meaningful units known as tokens. A token is a pair of a lexical category and an optional attribute, called a lexeme, extracted from the character sequence by pattern matching. The lexer performs tokenisation by going through the character sequence and matching the input against a list of token patterns. The output is a sequence of tokens. Here’s how our JSON example might look tokenised:

1 LBRACE, { STRING, names }, COLON, LBRACKET, { STRING, Tom }, COMMA, { STRING, Tim },\
2  COMMA, { STRING, Tam }, RBRACKET, RBRACE

{ STRING, Tam } indicates a STRING token with the attribute Tam. Note that the double quotes in the input are not present in the tokenised output. Their only purpose is to delimit strings and so once we tokenise the input, we don’t need them any more. Most of the tokens don’t need an attribute because the token type conveys all of the necessary information e.g. LBRACE. Another valid approach would be to have a single PUNCTUATION token specialised via the attribute e.g. { PUNCTUATION, ] }.

The reference to “pattern matching” above should tip you off that tokens use regular expressions to specify the lexemes they match. Sometimes they’re pretty obvious. The pattern for the LBRACKET token is just /[/. Whenever the lexer matches a left bracket, it outputs a LBRACKET token. The patterns for identifiers and numbers can be more complicated, particularly when numbers can have negatives, decimals and exponentiation. A programming language’s lexical grammar will define the rules for matching every type of token. Here’s a simple example of how C identifiers are defined:

1 letter     = [a-zA-Z_]
2 digit      = [0-9]
3 identifier = letter(letter|digit)*

A valid C identifier is any combination of letters (including underscore) and digits, provided it begins with a letter. As we know from the theory of computation, regexes can be encoded as finite automata. Lexers are often implemented by defining each token’s match pattern as a regex and then using tools called “lexer generators” to automatically combine each regex’s finite automaton into a single, deterministic finite automaton that matches every token. This is relatively easy to do programmatically because regular expressions are computationally simple. The regular expression pattern for the C identifier rules above would be: [a-zA-Z_][a-zA-Z0-9_]*. You can see clearly how it is constructed from the constituent lexical rules.

The simplicity of regular languages brings limitations. Recall from our theory of computation discussion that a finite automaton must have a limited number of states. It is not possible to create regular expressions that can “remember” how many times they’ve seen a character. When checking for syntactically valid code, it is very common to need to do this. For example, a common rule is that every opening brace must be followed by a corresponding closing brace. The lexer can catch simple errors such as an invalid identifier but it cannot catch these more complex errors. Had I left off the closing brace in the JSON example above, the lexer would have just generated the same token sequence minus the final RBRACE. More sophisticated analysis requires a more powerful tool.

Syntax analysis

The lexer has generated a sequence of meaningful tokens. The parser takes the token sequence and uses the language’s syntax rules to recreate the structure of the program. The parser either reports the syntax errors it finds or generates IR for consumption by subsequent stages. This is the last point at which the compiler works directly with the source code.

The syntax rules are normally defined in a language’s specification as a grammar: a collection of production rules. Each rule describes how tokens may be combined to produce a syntactically valid statement. The syntax rules for programming languages can become quite complex. Designing good syntax is a balancing act between making code readable for humans and making it easily parseable for the compiler. Rules are often defined in Backus-Naur Form (BNF). A simplified grammar for JSON is below. I’ve used an extended form of BNF that includes regex-style syntax for expressing repetition and optionality.

1 json   ::= value
2 value  ::= string | number | object | array | 'true' | 'false' | 'null'
3 
4 string ::= [a-zA-Z_]*
5 number ::= 0 | [1-9] [0-9]*
6 
7 pair   ::= string ':' value
8 object ::= '{' pair (',' pair)* '}' | '{' '}'
9 array  ::= '[' value (',' value)* ']' | '[' ']'

The value to the left of each ::= is known as a term. The definition of the term is on the right. If a term’s definition includes another term, we say the term is non-terminating (try saying that ten times quickly). Thus, pair is non-terminating because its definition includes string and value. The major difference from the lexical grammar above is that BNF allows us to “remember” matching elements. The definition of array allows an arbitrarily nested and arbitrarily long list of values followed by a closing bracket. That would not be possible to express with a regex.

Before moving on to how parsing actually works, I want to focus for a moment on the relationship between language and grammar. Recall from the theory of computation that the set of valid inputs to an automaton is known as the automaton’s “language”. We also know that a grammar defines how valid statements in the language can be constructed. The three concepts of automaton, language and grammar are all equivalent. Each type of automata accepts (or recognises) a set of languages defined by a particular type of grammar. This is what we want our parser to do. We want some way to verify that the input token sequence conforms to the syntax rules of the language. The relationship between grammars, automata and languages is captured by the Chomsky hierarchy of formal languages and their capabilities:

The Chomsky hierarchy

Each grammar is more powerful than the ones below. We know that regular expressions construct finite automata that accept regular languages. Push-down automata (PDA) are more powerful than finite automata, which means that they can implement more complex grammar rules and accept a bigger set of languages. We know from our previous study of push-down automata that they can check for things like balanced parentheses, which regexes can’t do. Grammars that can be implemented with push-down automata are known as context-free grammars. Obviously, a Turing machine can accept everything that can be accepted by an automaton.

We won’t talk too much about context-sensitive grammars and linear bounded automata, which are a restricted form of Turing machines. The problem with a context-sensitive or recursively enumerable grammar is that parsing that language would require a Turing machine. The simple act of parsing a program would entail arbitrary computation and, due to the halting problem, would not be guaranteed to terminate. This does sometimes actually happen, usually by accident. In C++, it is possible to write a program that is only syntactically correct if a particular number is prime (see further reading).

Is our JSON grammar context free? A context-free grammar requires that each production rule consist of a single nonterminal on the left of the ::= and a sequence of terminals and/or non-terminals on the right. You can see that every rule in our JSON grammar meets this definition. Therefore we have a context-free grammar. The appeal of context-free grammars is that they afford enough power and flexibility to write sophisticated syntax rules but are simple enough to be efficiently parseable. Parsers make use of a stack data structure to hold the parsing state, which is of course a concrete implementation of a push-down automaton.

The purpose of parsing is to take the flat sequence of tokens and turn them into something that reflects the structure of the program. A tree is a suitable data structure because the statements are hierarchical and nested. Parsing our JSON example produces a parse tree that would look like this:

 1             json
 2               |
 3             value
 4               |
 5            object
 6               |
 7          '{' pair '}'
 8               |
 9  string      ':'      value
10    |                    |
11 "names"               array
12                         |
13         '[' value ',' value ',' value ']'
14               |         |         |
15             string   string     string
16               |         |         |
17             "Tom"     "Tim"     "Tam"

Note that I’ve used punctuation symbols, rather than LBRACE and so on, purely to make the tree easier to read. Parsers construct the parse tree by consuming the input token by token and trying to match the observed sequence to production rules. There are two main approaches within this model. One is to work top-down from the most highest level of the parse tree to the lowest. Top-down parsers commonly parse the input from Left to right using Leftmost derivation (we’ll cover this shortly) and so are also known as LL parsers. The other approach is to work bottom-up by starting at the lowest level of the parse tree and constructing upwards. Bottom-up parsers commonly work from Left-to-right using Rightmost derivation and so are also known as LR parsers. I’ll use LL and LR from now on, just for concision.

So what does top-down, leftmost derivation actually mean? The LL parser iterates through the input token sequence. At each step, the parser can read the current token and the top of the stack. Based on those two values, it selects a production rule and, working from left to right, expands the rule’s non-terminal symbols until it reaches the terminal values. When the current token matches the predicted token, it is consumed and the parser moves to the next token. Let’s step through a section of the parsing process. Remember we are parsing the token sequence:

1 LBRACE, { STRING, names }, COLON, LBRACKET, { STRING, Tom }, COMMA, { STRING, Tim },\
2  COMMA, { STRING, Tam }, RBRACKET, RBRACE

1. The parser begins with LBRACE and an empty stack.
2. Our starting rule is json ::= value.
3. value is a non-terminal so we must expand it. We have several possibilities but the only one that can start with LBRACE is object. We expand the rule to json ::= object by pushing object on to the stack.
4. json ::= object expands to json ::= '{' pair '}' by popping object off the stack and pushing '}', pair and '{'.
5. We match the left brace at the top of the stack to the input LBRACE. We pop it off the stack and move to the next input token.
6. Now we must expand pair to json ::= '{' string ':' value '}' by popping off pair and pushing value, ':' and string on to the stack.
7. We match first the string and then the colon to the input tokens, popping each one off the stack and moving through the input sequence.
8. Now we have a value at the top of the stack and LBRACKET as the current token. The only matching rule is array.

Expanding the array is left as an exercise to the reader. Note that the right brace from object would remain on the stack until after the array had been fully parsed. That is how the parser keeps track of symbols it expects to see later. In this example, I was always able to choose the correct rule by looking solely at the current value. Depending on how the rules are formulated, just one token might not be sufficient to unambiguously select the next rule. A LL parser that can peek ahead by k tokens is known as a LL(k) parser.

You might have wondered exactly how I chose each rule correctly. A common way to implement LL parsers is to automatically generate from the production rules a parsing table that unambiguously specifies which rule to choose for each token and top of stack pair. The benefit of using a table-based construction is that the parser can be modified to parse a different language simply by changing the table.

Another method of constructing top-down parsers is called recursive descent. Instead of using an automatically generated table, recursive descent parsers are usually written by hand as a collection of mutually recursive functions. An explicit stack is not required because the call stack does the same job of keeping track of in-progress rules. Each function matches a single production rule. If the rule contains non-terminals, the function will call other functions representing the non-terminal rules:

 1 function expect(token) {
 2   const current = getToken();
 3   if (current !== token) {
 4     throw `Expected ${token}, got ${current}`;
 5   }
 6 }
 7 
 8 function matchPair() {
 9   matchString();
10   expect(':');
11   matchValue();
12 }

Recursive descent parsers are surprisingly straightforward and fun to implement. Writing your own one is a great way to get a better understanding of how parsing works.

LR parsing works slightly differently. Consuming the input sequence token by token is known as shifting. Each shift, the parser creates a new, single-node parse tree containing the current token. The parser remembers previously parsed items. When it recognises that it’s matched a whole production rule, the parser reduces the elements. It creates a new node for the matched rule, pops all the matched elements off the stack, attaches them as children to the new node and pushes the new node on to the stack. Multiple elements are reduced to a single value. A LR(k) parser decides which operation to perform based on the current value, previously seen values and k lookaheads. Let’s step through parsing our JSON token sequence with a LR parser:

1 LBRACE, { STRING, names }, COLON, LBRACKET, { STRING, Tom }, COMMA, { STRING, Tim },\
2  COMMA, { STRING, Tam }, RBRACKET, RBRACE

1. We shift to LBRACE and create a parse tree node for it. We don’t yet know which rule it is part of so we push it on to the stack.
2. Repeat the same operation for each token as far as the LBRACKET.
3. Based on the fact that we have just seen a LBRACKET and have a COMMA just ahead, we are in an array rule. Reduce the string to a value.
4. Repeat for the next two STRING values.
5. We have previously seen an LBRACKET and a list of comma-separated values. The current RBRACKET completes the array rule match. Reduce the brackets and comma-separated values to an array. We have now constructed the bottom right section of the parse tree.
6. By reducing the array to a value, we can then reduce the previously seen STRING and COLON into a pair.
7. Shifting again, we find an RBRACE. We can reduce the LBRACE, pair and RBRACE into an object.
8. Reduce object into json.

LR parsers take a bottom-up approach by incrementally combining terminals into rules and those rules into larger, non-terminal rules until it reaches the root term at the top of the tree. Values it doesn’t yet know how to deal with are pushed on to a stack for use later. LR parsers are more powerful than LL parsers, in that they can accept a wider range of grammars, but they are also harder to write by hand and so are nearly always auto-generated by programs such as yacc and bison.

Whereas a LL parser starts constructing the expected rule while in the process of matching, a LR parser waits until it has seen every part of a rule before committing to it. You might find it helpful to imagine how you would traverse the parse tree in the order that each parser creates the nodes. A LL parser corresponds to a pre-order binary tree traversal: it starts at the parent node, takes the left branch and then the right. A LR parser corresponds to a post-order binary tree traversal: it starts at the left branch, then the right branch and then the parent node.

Parsers, whether LL or LR, work best when they can always deterministically choose the correct rule to apply. If the parser chooses an incorrect one, it might continue to progress but will eventually hit an error state when it has an input token it cannot match to any rule. It must then backtrack through the token sequence, select another candidate rule and discard all the work from parsing the incorrect rule. If the parser has to backtrack frequently, this has a hugely detrimental impact on performance. Programming languages are usually designed to be deterministically context-free, which means that they can be efficiently parsed by an LR(1) parser (with just a single lookahead). Thanks to these developments, parsing is nowadays seen as a more or less solved problem for compilers.

Semantic analysis

The final step of the front end is to perform a semantic analysis. So far the compiler has only looked at whether the input is syntactically valid. As we’ve seen, parsing is a fairly mechanical process that can determine structure but not meaning. For example, the English sentence “red wind thinks abruptly” is syntactically correct but semantically meaningless.

Type checking is one very common way of checking that the operations in the parse tree are semantically meaningful. The compiler must analyze the program and assign a type to each name and each expression. It must check these types to ensure that they are used in contexts where they are correct according to the language’s type system. Depending on the strength of the type system, the compiler might be able to detect many errors. For example, a + b might be a syntactically correct expression but semantically incorrect (i.e. a type error) if a is a string and b is a number.

In languages that support operator overloading, semantic analysis is required to determine the correct operator. “Operator overloading” means that operators such as + may have multiple meanings depending on the type of their operands. In Java, 1 + 2 is addition but "hello " + "world" is concatenation. Parsing alone cannot tell the compiler which operation + represents. It must consider the semantic meaning of the expression in order to generate the correct IR.

At this stage it is common for compilers to create secondary data structures, separate from the IR, to hold useful bits of data. An example is a symbol table recording ever declared identifier and all the information the compiler has managed to find out about it. The exact attributes vary from language to language but will probably include things such as identifier name, type, scope and so on. If the language specification requires that a variable be declared before it is used (which is impossible to express in BNF), the compiler can easily enforce this by checking that every referenced identifier already has an entry in the symbol table.

Intermediate representations

Intermediate representation (IR) is an internal, compiler-specific data structure or language in which the front end encodes all of the information it has derived from the source code. It doesn’t contain any new information. It is merely the input program re-expressed in a format that is easier for the middle and back ends to process. There are two main varieties: the abstract syntax tree (AST) and linear IR.

The parse tree, also known as a concrete syntax tree, contains lots of syntactic cruft that’s not really important. Once we know the structure, there’s not really much point holding on to syntax-specific details such as the brackets delimiting arrays. An abstract syntax tree (AST) is a simplification of the parse tree that reflects the structure of the program without unnecessary details. The AST of our JSON example might look like the following:

1             json
2              |
3            object
4              |
5  string             array
6    |                  |
7 "names"   string   string   string
8              |        |        |
9            "Tom"    "Tim"    "Tam"

Alternatively, if the compiler uses a virtual machine internally (e.g. LLVM) it’s common for the front end to generate linear IR in the form of bytecode for the virtual machine. The format of bytecode is generally a compact, simplified assembly. Though the bytecode looks like assembly, it is still “abstract” in the sense that it does not map directly to the target architecture’s ISA. The compiler back end is responsible for later translating bytecode to whatever sequence of machine instructions is required.

Linear IR is often expressed in static single-assignment form (SSA), where every variable is defined before it is used and is only assigned once. Reassignments in the source code are re-expressed as assignments to uniquely-indexed variables. This sounds unnecessarily complex but it actually makes things much clearer. Giving every assignment its own name makes it very clear which variables are in use. In the following example, the compiler would have to perform analysis to determine that the first assignment isn’t used:

1 name := "Tom"
2 name := "Tom Johnson"
3 full_name := name

Giving each assignment its own name makes it trivially obvious to the compiler that name1 plays no part in determining full_name1:

1 name1 := "Tom"
2 name2 := "Tom Johnson"
3 full_name1 := name2

Middle end / optimiser

The job of the middle end, also known as the optimiser, is to take the IR generated by the front end, analyse its content and transform it to perform the same computation in a more efficient way. Efficiency may be measured in terms of space (the number of instructions) or execution time (op count * cycles per op). The objective is a smaller, faster program that does the same thing with fewer resources.

Optimisation is where compilers can really set themselves apart from the competition. Selecting effective optimisations is something of an art and requires a deep understanding of the trade-offs involved. For example, modern processor performance is very sensitive to efficient cache utilisation and so minimising cache misses for both instructions and data is very important. An optimisation that generates faster but more verbose machine code might perform better in isolation but, when applied to a real program, the more verbose machine code might push important values out of the cache and so lead to worse performance overall. In this section I’ll illustrate a few common optimisations by showing equivalently optimised source code examples. Remember that optimisations are actually performed on the IR. Don’t try and manually optimise your source code.

An optimisation is a combination of analysis to identify inefficiencies and a transformation to remove the inefficiency. At this compilation stage, the optimisations are independent of the target architecture. Target-specific optimisations are performed in the back end. To be an effective optimisation, the compiler needs to get a good return on the time it invests in analysis. An optimisation is impractical if the analysis takes too long. Many optimisation problems are NP-complete or even undecidable if you want to identify absolutely every optimisation opportunity. Compilers therefore use heuristics to get approximate answers. The analysis algorithm only needs to be “good enough” to find a reasonable number of inefficiencies in a reasonably short time. An optimisation must be applicable to any program and ideally affect “hot” parts of the code that are executed frequently, thereby producing a more noticeable improvement for the analysis time invested.

Compilers construct supplementary data structures to help them analyse the program. A basic block is a sequence of operations that always execute together e.g. a function body with no control flow statements. Control enters the block at the first operation and always leaves at the last. A control flow graph (CFG) is a directed graph in which basic blocks are the nodes and possible control flow transfers are the edges. CFGs are prerequisites for many useful optimisations.

A control-flow graph. Basic blocks are connected via control flow transfers

There are many, many possible optimisations. Optimising compilers come with whole suites of optimisations and generally allow the programmer to choose the optimisation aggressiveness via command line flags. Optimisations are performed in a sequence. We’ll start by looking at optimisations that act within a basic block and gradually increase our scope to the whole program. Please don’t think you need to memorise all of these optimisations or start writing your code according to the examples. I merely want to demonstrate how something that sounds as magical as “optimising a program” is nothing more than the accumulation of relatively simple techniques.

Local optimisations

Analysis and transformations for local optimisations all occur within a single basic block. No control flow needs to be considered, aiding efficient analysis.

Local common subexpression elimination detects when the same thing is computed multiple times within the same basic block:

1 shifted := math.Pi * freq + 100
2 scaled := math.Pi * freq * 0.5

It will most likely be faster to store the value in a register rather than recompute it twice:

1 tmp := math.Pi * freq
2 shifted := tmp + 100
3 scaled := tmp * 0.5

A related optimisation is local constant folding. If a value doesn’t change at runtime, the compiler can compute it at compile time and insert the constant value into the compiled code:

1 const MEGA_BYTE = 1024 * 1024

Why do it at runtime, repeating the work every single time the program runs, when the compiler has all the information it needs to do it now?

1 const MEGA_BYTE = 1048576

Strength reduction means replacing slow, complex operations with simpler, faster ones. This can be performed at multiple levels of analysis. Within a basic block, the compiler might replace multiplication and division by two with bit-shifting operations. Division in particular is slow for many processors. Dividing by 2n is equivalent to shifting n bits to the right. For example, 28 / 4 is the same as shifting two bits right:

1 0b00011100 // 28
2 0b00000111 // 7

Dead code elimination deletes code that contributes nothing to the computation. It’s semantically meaningless and serves no useful purpose:

1 function deadCodeCalc(num) {
2   const a = 200 * 4 * num;
3   const warningState = a > (1024 / a * 2);
4   return a;
5 }

warningState is local to the function and so can’t affect anything outside of the function. It doesn’t do anything inside the function either so the compiler can remove it without changing the program’s behaviour. A dead variable plays no role in the computation.

Loading and storing values comes at a cost, particularly if any of the values are held in memory (remember memory accesses are comparatively slow). Detecting redundant loads and stores can eliminate this cost, or at least make it more likely that the essential values can all fit in the processor’s registers.

1 initial := x + 5
2 copy := initial
3 copy2 := copy
4 res := copy2 * 5

In this admittedly contrived example, the intermediate variables can be optimised away:

1 res := (x + 5) * 5

Of course, the compiler needs to be sure that copy and copy2 aren’t used anywhere else.

Global optimisations

These optimisations occur within a CFG. That means the analysis must consider behaviour across basic blocks. These optimisations might be more analytically expensive but can offer greater efficiency gains than local optimisations.

Many local optimisations can be performed at a global level. For example, the compiler might determine that the expression in an if statement always evaluates to true. Since that branch will always be taken, the compiler can eliminate the check and the entire else branch. That might eliminate whole chunks of the CFG if the compiler can further determine that the only way to access those code paths was through the initial else branch.

1 def dead_code_calc(number)
2   a = 200 * 4 * number
3   return a
4   1000.times { |n| n * n }
5 end

In this function, the code after return can never possibly be reached. Cutting out code creates opportunities for further optimisation. For example, if number is always the same, a later optimisation might replace the whole call to dead_code_calc(number) with a constant.

Of particular interest are loop optimisations because loops are usually hot paths. Even a small optimisation in a loop can have a huge impact over many iterations. Loop unrolling involves rewriting a loop to perform more steps in each iteration. In this basic C example, we are summing an array of numbers:

1 for (int i = 0; i < n; i++) {
2   sum += data[i];
3 }

To perform n additions we need n iterations. An optimisation might transform the code into this:

1 for (int i = 0; i < n; i += 4) {
2   sum_0 += data[i + 0];
3   sum_1 += data[i + 1];
4   sum_2 += data[i + 2];
5   sum_3 += data[i + 3];
6 }
7 sum = sum_0 + sum_1 + sum_2 + sum_3;

In this version, we only need n / 4 iterations to perform n additions. We have reduced the iteration overhead. Depending on the target processor’s support for parallelisation, it might even be possible to perform all four additions in parallel. In the unoptimised example, each addition depends on the value of the previous one and so cannot be parallelised at all.

Loop unrolling reduces data dependencies, enabling parallel computation

Loop-invariant code motion moves operations outside of a loop without changing the loop’s semantics. That means less work per loop iteration.

1 res := 10.0
2 for t := 0.0; t < cycles * 2 * math.Pi; t += res {
3     // ...
4 }

Assuming that cycles doesn’t change within the loop body, there is no point recomputing cycles * 2 * math.Pi every iteration. The compiler will move it outside of the loop:

1 res := 10.0
2 limit := cycles * 2 * math.Pi
3 for t := 0.0; t < limit; t += res {
4     // ...
5 }

If it can determine that cycles is constant, this would also be an opportunity for constant folding. It’s important that the compiler knows for sure that the value won’t change (i.e. that it is invariant). You might think that a clever C compiler would move strlen(input) out of this loop:

1 for (int i = 0; i < strlen(input); i++) {
2     // ...
3 }

In fact, due to the semantics of C, it is very difficult for the compiler to guarantee that input won’t change between iterations and so it would most likely call strlen(input) on every iteration.

Inter-procedural optimisations

Some analysis can only be performed on the program as a whole. This is even more expensive than global (i.e. function) analysis but offers even greater optimisation benefits.

In any architecture, calling a function will entail some overhead. There will be some kind of CALL operation, arguments need to be moved into registers and a stack frame needs to be created. At the end there’ll be a RET operation, the return value needs to be written into a register and the stack needs to be updated. Function inlining seeks to avoid all that overhead by replacing a function call with the actual code from the function body.

Here’s an example of a function that computes the area of a triangle. Assume that height can be negative to indicate a downwards-pointing triangle. To compute the area we need the absolute value of the height:

1 function abs(number) {
2   return number >= 0 ? number : -number;
3 }
4 
5 function triangleArea(width, height) {
6   return width * abs(height) / 2;
7 }

Making a separate call to abs is a lot of overhead for a simple conditional. The compiler can replace abs(height) with the function body:

1 function triangleArea(width, height) {
2   return width * (height >= 0 ? height : -height) / 2;
3 }

Moving code directly into the function creates further local optimisation opportunities by allowing the compiler to see more of what is going on without having to cross function boundaries.

Remember how I talked about caching earlier? The compiler needs to be careful when performing function inlining. The whole point of a function, of course, is to reuse code and reduce duplication. Each inlining creates another copy of the function’s instructions and so increases the size of the program. Function inlining is therefore most suitable for very small functions that are called frequently.

The optimiser essentially refactors the IR to make it more efficient. Lots of small optimisations can build on each other to produce dramatically faster or smaller output. To reiterate my earlier advice, always let the compiler do the work of analysing and rewriting your code as it sees best. It will almost certainly do a much better job. When you first learn about optimisations and code efficiency, it’s easy to start stressing about tiny details. Focus on writing clean, correct code and leave the optimisation to the compiler.

Back end

We come to the end of the pipeline. The back end takes the optimised IR and generates output suitable for the compilation target. This involves generating the actual machine code, allocating registers and some target-specific optimisations.

The compiler back end pipeline

Instruction selection

The first step is to generate machine code for the target architecture. The compiler identifies the various constructs in the IR and outputs corresponding machine code. This sounds complicated but it’s conceptually quite simple. Imagine that you are compiling a Python function into JavaScript (technically compiling from one language to another is known as transpiling). The function looks like this:

1 def circle_area(radius):
2     return math.pi * radius ** 2

After parsing, you’ll have some IR showing that a function named circle_area takes an argument called radius and returns its square multiplied by pi. All you need to do is convert each element of the IR to JavaScript. You know how to define a function in JavaScript:

1 function FUNC_NAME(ARG_LIST) {
2     FUNC_BODY
3 }

It’s pretty trivial to take the details from the IR and plug them into the template:

1 function circle_area(radius) {
2     return Math.PI * radius ** 2
3 }

The principle is the same for machine code. When two values are added, output an ADD $1, $2 instruction. When there is a conditional, output the comparison followed by a suitable conditional jump instruction and arrange the code branches correctly. The back end knows how each IR element can be expressed in the target architecture’s machine code. As a simplification at this point, the back end usually assumes for now that there are an infinite number of registers available. Register allocation is handled later.

Once the code is generated, we have further opportunities for optimisation. The compiler can analyse the auto-generated machine code and identify inefficiencies just as it did in the middle end. The difference is that the back end knows the compilation target architecture and can apply architecture-specific optimisations. For example, to set a register to zero on x86 it is faster and shorter machine code to perform xor $reg, $reg (exclusive or with itself) than mov $reg, 0 (write zero into the register).

Register allocation

Program variables need to be held somewhere. Ideally, they will stay in the processor’s registers so that they can be accessed very quickly. So far we’ve pretended that the target has an infinite number of registers. This simplified the instruction selection task but, of course, in reality the architecture will have a limited amount. The 64-bit x86 ISA only has 16 general use registers! 64-bit ARM has a little more at 31. Either way, we’re constrained. The task of register allocation is to assign each active variable to a register, being mindful of data dependencies between instructions. When there are too many values to fit into the registers, the rest must “spill out” to main memory. Since main memory accesses are so much slower, this has a big impact on performance and is generally undesirable.

Interestingly, register allocation can be modelled as a variation of the graph colouring problem. Envisage a graph in which each node is an in-use variable. Edges go between nodes that interfere with one another, meaning that they must both be live and accessible at the same point in the program. Each register in the system is assigned a colour. Allocating registers then reduces to the problem of colouring each node (i.e. assigning a register) such that no two connected nodes have the same colour. Unfortunately for the poor compiler, this is an NP-complete problem and so an optimal solution is not feasible. The best it can hope to achieve is a reasonably good attempt in a reasonable time.

Instruction scheduling

Pipeline performance is the primary concern of this final step. Recall from our discussion of computer architectures that modern processors use instruction pipelining to improve performance. Rather than work on just one instruction at a time, the processor works on multiple instructions at different stages of the execution cycle. While executing one instruction, it can also decode the next and read in the one after that. The performance of pipelined processors is very sensitive to blockages known as pipeline stalls. If an instruction in the decode stage of the pipeline needs to read a register written by an instruction in the execution stage, the pipeline must stall until the executing instruction writes its value into the register.

When faced with a branching instruction, the processor doesn’t know which branch to take until the instruction has fully executed. It can either wait and let the pipeline empty or, more likely, try to guess which branch will be taken and begin executing that branch before the branching instruction has even finished executing. A wrong guess means the whole pipeline has to be flushed and work restarted. The back end must detect such dependencies and arrange the instructions to avoid stalls in a way that doesn’t change the semantics of the code. This obviously requires the compiler to have a deep understanding of the target architecture.

Once the compiler has scheduled the machine code instructions, its work is done. It has taken a flat sequence of characters, reconstructed the program structure, analysed and optimised it and finally written it back out in the compilation target’s language. All that remains is for the linker to gather the object code from each translation unit into a single executable.

Who to trust?

I want to round off this chapter, and indeed the whole book, with a more philosophical point. One recurring idea in this book is that modern computing is a huge tower of abstraction. Look at all the mangling and wrangling your source code goes through to become an executable. And look at how many levels of abstraction there are below your program in the runtime environment, operating system and hardware. How do we actually know that the program is doing what we told it to do and nothing else?

In general, we can’t be sure about what a program is doing without looking at its binary code. We can use a program known as a disassembler to convert raw machine code back into assembly and examine that. Disassembled code is notoriously hard to read because all of the useful comments and names will have been removed or mangled by the original compiler. Assembly code concerns itself with the low level details and it is hard to reconstruct the overall intent. With time, dedication and skill, it’s definitely possible to reverse engineer a program and work out what it does, but it’s infeasible to do that routinely for anything but the most trivially small programs.

Reading source code is easier but we need to be sure that the compiler’s output matches the original source code’s semantics and behaviour. We could examine the compiler’s source code to check that it faithfully preserves behaviour. Then again, since all binary programs have been compiled it stands to reason that a compiler’s binary executable must itself have been compiled. How do we know that the compiler was compiled correctly?

The simple answer is that we need to trust. In Reflections on Trusting Trust (see further reading) Ken Thompson, one of the original Unix developers, described how he added a “hack” to the login command’s source code. The hack gave him backdoor access to any system using the bugged binary. Obvious, Thompson did this for demonstration purposes only! A security-conscious system administrator might try and prevent this class of attack by compiling login from source that they had carefully reviewed themselves. Thompson’s backdoor would be easily detected by a code review:

1 if (user_name == "KEN_THOMPSON") {
2   permitAccessAllAreas();
3 }

So he moved the code for the hack out of login’s source code and into the C compiler’s source code. After compiling a bugged C compiler to include the new backdoor, Thompson had a bugged compiler that would detect when it was compiling login and would inject the original hack into the output. The hack wouldn’t appear anywhere in the login source code.

Of course, our security-conscious administrator might think to review the compiler’s source code before compiling it. The hack would still very obvious:

1 if (input_program == "login") {
2   append_code("if (user_name == \"KEN_THOMPSON\") {\
3                  permitAccessAllAreas();\
4                }");
5 }

So Thompson made his hack even more sophisticated. He changed the C compiler hack to detect when it was compiling a compiler and inject the compiler hack into the output compiler.

1 if (input_program == "c_compiler") {
2   append_code("if (input_program == \"login\") {\
3                  append_code(\"if (user_name == \\"KEN_THOMPSON\\") {\
4                    permitAccessAllAreas();\
5                  }\")\
6                }");
7 }

The clean compiler would output a bugged compiler that would output a bugged login. Thompson then removed the hack from the compiler’s source. If our crafty system administrator had discovered that the bugged compiler was mis-compiling login, scouring the compiler’s source code for hacks and recompiling a clean version wouldn’t work. The compiler would always reinsert its own hack. No amount of source code scrutiny would reveal that the compiler was sneaking in nefarious code.

Thompson’s conclusion was that you cannot trust code that you did not completely create yourself, all the way down to machine code. You have to trust the code’s creator. And even then, you can’t be sure that the processor is executing it as you intended. You have to trust the processor’s designers to design the processor to behave according to its published ISA. Someone else had to write an assembler in raw machine code. Then another person used that assembler to write a simple C compiler. And then yet another person used that compiler to compile a more complex compiler written in C. Another team of people produced the operating system that runs all these programs and (hopefully) faithfully and honestly mediates their access to the hardware. We must put our trust in the many people who put in so much work over decades to build modern computing.

Conclusion

Compilers are sophisticated and fascinating programs that convert programs from a high level form, normally source code, into a lower-level form ready for execution. Compilers may generate executable binaries for direct execution or intermediate bytecode intended to run in some kind of interpreter or virtual machine. They may do their work ahead of time or as the program is running. Compilers are just programs like any other, but they are a little bit special because their input and output is other programs.

Compilers are typically structured as pipelines. The front end handles converting source code into the compiler’s intermediate representation (IR). First it lexes the input character sequence into a sequence of meaningful tokens. These tokens are then parsed according to the rules of the language’s grammar to derive the structure of the program and the IR is generated. The optimiser refactors the IR into an more efficient but semantically-equivalent form by analysing the IR for inefficiencies and rewriting them. Finally, the back end converts the IR into the compilation target’s language (object code or bytecode), taking care to schedule instructions and allocate registers as efficiently as it can. Each translation unit is compiled into object code that is combined by the linker into an executable binary. External libraries or runtime functionality may also be linked into the executable.

Further reading

The best way to understand compilers is to write one yourself. The final chapters of nand2tetris require you to write a virtual machine and a compiler that compiles a simple language into the VM’s bytecode. Both are challenging projects but immensely rewarding. Writing the recursive descent parser in particular is a fun and satisfying task. If that feels like a step too far for you, try starting by writing a simple calculator program. As you implement more complex functionality (nested expressions and so on), you will get exposure to how parsing works.

I have previously recommended Ruby under the microscope. It has a well-explained and illustrated overview of how Ruby tokenises and parses code, compiles it to bytecode and runs it in a virtual machine. Unfortunately, it describes an older version of Ruby’s implementation but this shouldn’t detract from its usefulness unless you particularly need to know about Ruby’s current implementation.

Thompson’s Turing Award lecture is known as Reflections on trusting trust. It’s very short and very influential.

I mentioned in the chapter that C++’s parsing is accidentally Turing complete. This comes about as new syntax rules get added to a language and have unintended consequences. This Stack Overflow answer demonstrates a program that is only syntactically correct if a number is prime.

Recently there have been a number of high quality compiler and interpreter books aimed at developers without computer science backgrounds. Crafting Interpreters by Bob Nystrom and Thorsten Ball’s Writing an Interpreter in Go and Writing a Compiler in Go are all highly recommended for a more in-depth but practically oriented treatment of compilers.

When it comes to compiler textbooks, you are spoiled for choice. The tricky thing is finding the most appropriate one for your needs. The canonical compiler “Bible” is Compilers: Principles, Techniques, and Tools (Aho, Lam, Sethi, Ullman), also known as the “Dragon book” after its cover. However, it’s now considered rather dated. At the time, efficient parsing was a major challenge so it spends a lot of time on that and less time on the more modern challenges of optimisation. Engineering a Compiler is more up to date. It covers plenty of optimisations and is focused on the practicalities (should you ever wish to write your own optimising compiler). It’s a very substantial textbook and although it does start from the basics you might want to have worked through one of the introductory texts first.