Compilation From a 10,000 ft view

Your Program: hello.c

Let's build this thing. I believe in us.

#include <stdio.h>

int main(void) {
  printf("Hello world!");
  return 0;
}

What Now?


If you are a sensible human attempting to compile and run this piece of code, you may try a shell command like the following:

> clang hello.c -o hello.out
> ./hello.out
Hello world!

This seems reasonable, right? "man clang" tells us that clang is a compiler, and a compiler is just that thing that takes code and makes it runnable, right?

To tell the truth, clang is telling a little white lie. The thing we call "compiling" actually consists of at least four distinct steps:

  1. Preprocessing
  2. Compiling
  3. Assembling
  4. Linking

Preprocessing

You know those lines in C that start with "#"? The ones like "#ifdef ____", or "#include <...>"? Before we start translating the source code into something executable, these lines are expanded. Any include paths are expanded to their full paths, any section in an undefined "#ifdef" will be removed, all macros will be replaced, and comments will be stripped.

This command can be executed through the C preprocessor (cpp) or through Clang's command line flags:

> clang -E hello.c hello.i

Compiling


This topic is actually a bit too broad for this post, so I'll try to keep it brief (and hopefully post some follow-up posts later, to expand on the topic). The actual compilation is the most interesting part of this whole process: here, we translate preprocessed source code into human-readable assembly. Essentially, the source code is transformed into "basic instructions", which can be used by a processor as basic commands.

> clang -S hello.i hello.s

If we pop open "hello.s", we'll see some of this raw assembly (your assembly may differ):

  .section  __TEXT,__text,regular,pure_instructions
  .globl  _main
  .align  40x90
_main:                                  ## @main
  .cfi_startproc
## BB#0:
  pushq %rbp
Ltmp2:
  .cfi_def_cfa_offset 16
Ltmp3:
  .cfi_offset %rbp, -16 
  movq  %rsp, %rbp
Ltmp4:
  .cfi_def_cfa_register %rbp
  subq  $16, %rsp
  leaq  L_.str(%rip), %rdi
  movl  $0, -4(%rbp)
  movb  $0, %al 
  callq _printf
  movl  $0, %ecx
  movl  %eax, -8(%rbp)          ## 4-byte Spill
  movl  %ecx, %eax
  addq  $16, %rsp
  popq  %rbp
  ret 
  .cfi_endproc

  .section  __TEXT,__cstring,cstring_literals
L_.str:                                 ## @.str
  .asciz  "Hello world!"


.subsections_via_symbols

Parsing through this code, we can see that our "main" function sets up the stack, puts our "Hello world!" string (L_.str) into a register (%rdi), sets up some other arguments, calls "printf", and then returns. This is what we expected!

Assembling


If we try telling our operating system to run this assembly file, it will not know what to do. That's weird: if we can read that cryptic assembly, why can't the computer? Actually, our machines can only understand something called "machine code" -- basically, it's a dense binary format of ones and zeroes that computers are great at reading, but that looks like nonsense to the untrained human eye.

Alright, fine. How do we get that machine code? By assembling, of course! The machine code originates directly from the assembly code we created earlier -- a command like "movl" will be replaced with some number, like "0x1234". Well, disclaimer: movl's machine code instruction is probably not actually 0x1234, but it is a number, instead of a string. Assembling is, more or less, a one to one translation from string instructions to a denser format which encodes those instructions.

To assemble the ".s" file, we can use the gcc "as" command, or with clang:

> clang -c hello.s -o hello.o

Linking


We have our object file (hello.o) -- we're so close to being able to run an actual executable file! But we can't run the "hello.o" file by itself: it has some unresolved symbols, and it is not formatted as an executable.

To print "Hello world", we had to include stdio.h, which means we're actually using some object files other than hello.o. When we compiled "hello.o", the "printf" function was one of these unresolved symbols, meaning that clang was a lazy procrastinator, and decided to figure it out later. In the assembly, we can see "callq _printf": what does this even mean to our processor? What's a "_printf" anyway? Truth is, the compiler didn't know. It just figured "Hey, <stdio.h> says this is defined somewhere else, so I'll trust that guy". When we get to the linking stage, the linker is responsible for resolving these symbols by finding them in other object files.

Additionally, when the linker combines all of these connected object files, it can find a "main" function (well, actually, a "_start" function, but more on that some other day), and connect all this machine code into an executable file (or shared library). This basically involves putting things in a standard format, like ELF, so that a program loader will be able to run the code at a later point in time.

To link the object file (or object files, if we are using many files of compiled code), we can just pass them directly to clang (or, to "gcc" -- though we could also use the "ld" command).

> clang hello.o -o hello.out
> ./hello.out
Hello world!

So next time you compile your code, don't be fooled: a lot is happening behind the scenes! Let me know in the comments if you'd like me clarify or expand on anything.

Comments