Getting Familar with LLVM IR

Tips:

Code snippets are shown in one of three ways throughout this environment:

Code that looks like this is sample code snippets that is usually part of an explanation.
Code that appears in box like the one below can be clicked on and it will automatically be typed in to the appropriate terminal window:
```
vim readme.txt
```
Code appearing in windows like the one below is code that you should type in yourself. Usually there will be a unique ID or other bit your need to enter which we cannot supply. Items appearing in <> are the pieces you should substitute based on the instructions.
```
Add your name here - <name>
```

1. Overview

The LLVM IR (Intermediate Representation) is a Static Single Assignment (SSA) based representation that provides type safety, low-level operations, flexibility, and the capability of representing all supported high-level languages such as C/C++, Objective-C, and Fortran. It is the common code representation used throughout all phases of LLVM. LLVM IR is a low-level programming language similar to assembly. As a result, it is also called LLVM assembly language.

The LLVM IR can be used in three different forms: as in in-memory compiler IR, as an on-disk bitcode file, and as a human readable text asembly language file. These three forms are equivalent. Tools are available to convert from one form to another.

The goal of this tutorial is to learn how to use clang to dump out LLVM IR using a simple example program. Then we will guide the readers to consult the LLVM Language Reference Manual to understand the LLVM IR’s text format.

2. Test Clang and LLVM

Clang and LLVM have already been installed in the the docker-based online terminal on the right panel.

To test clang and llvm’s optimizer, try the following command lines:

clang --version

and

opt --version

You should see the version information after the commands above.

3. Obtain Example Input Codes

git clone https://github.com/chunhualiao/LLVM-IR.git
cd LLVM-IR

This git repository contains

a set of example C/C++ programs
a makefile to call clang to generate LLVM IR dump for the input program
a set of example output llvm ir files for the input programs
- For example, clang-9.0.1 stores the sample ll output files generated by using clang 9.0.1

You can either directly look into the IR files or regenerate them by yourself:

type make clean will clean up all the *.ll files.

Type make all will regenerate all the *.ll files

4. Different Ways to Dump LLVM IR

The makefile uses four different ways to generate .ll files for each input program. So there are v1 to v4 for a .ll file.

The four ways are:

v1. clang -S -emit-llvm input.c # default
v2. clang -O3 -Xclang -disable-llvm-passes -S -emit-llvm input.c # turn on O3 but disable LLVM passes
v3. opt -S -mem2reg -instnamer input.v1.ll # using v1’s result as input
v4. opt -S -mem2reg -instnamer input.v2.ll # using v2’s result as input

As you can see, v1 and v2 are generated by using clang, the C frontend and also the compiler driver of LLVM. v3 and v4 use opt, the LLVM optimizer, to optimize the LLVM IR generated by clang so the IR looks prettier and easier to understand.

The options used with clang are:

-S: Only run preprocess and compilation steps
-emit-llvm: Use the LLVM representation for assembler and object files

The options used with opt are:

-mem2reg: Move as many variables to registers as possible
-instnamer: Assign names to anonymous instructions

It is recommened to read v4 of the ll files since they are easier to understand.

To give it a try:

make clean && make func1.v4.c.ll

You will see the following screen output

clang -O3 -Xclang -disable-llvm-passes -S -emit-llvm func1.c -o func1.v2.c.ll
opt -S -mem2reg -instnamer func1.v2.c.ll -o func1.v4.c.ll

func1.v2.c.ll and func1.v4.c.ll should be generated in the current path.

5. Simplest Function

Now let’s look at the text dump of LLVM IR for a simplest C Function. Lets open the input file first:

cat func1.c

You should see the following content:

int foo(int i)
{
   return i;
}

We now open up the corresponding ir dump in the clang-9.0.1 folder to explain how to read the text dump of LLVM IR.

You can also open func1.v4.c.ll in your current directory. The content should be very similar.

cat clang-9.0.1/func1.v4.c.ll

You should see the following content:

; ModuleID = 'func1.v2.c.ll'
source_filename = "func1.c"
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.15.0"

; Function Attrs: nounwind ssp uwtable
define i32 @foo(i32 %arg) #0 {
bb:
 ret i32 %arg
}

attributes #0 = { nounwind ssp uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad    "="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-ju    mp-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"    ="8" "target-cpu"="penryn" "target-features"="+cx16,+cx8,+fxsr,+mmx,+sahf,+sse,+sse2,+sse3,+sse4.1,+ssse3,+x87" "unsafe-fp-math"="false    " "use-soft-float"="false" }

!llvm.module.flags = !{!0, !1}
!llvm.ident = !{!2}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 7, !"PIC Level", i32 2}
!2 = !{!"clang version 9.0.1 "}

6. The High-Level Structure of the IR

A module in LLVM is a translation unit of the input program. Each module consists of

global variables (if present),
functions (if present), and
- basic blocks
  - instructions
symbol table entries (if present).

From the IR dump of the simplest C function shown above, we can see that the corresonding LLVM IR has the following sections:

line 1 through 4: information about the current module, such as ID, source file name, data layout and platform triple.
line 6 through 10: the definition of the function foo ().
line 12: attributes of the function
line 14-19: named metadata

It is also obvious that comments in the IR starts with ‘;’.

Now, lets go to each section one by one to explain what each line means.

7. Target Information

target datalayout = “layout specification” specifies how data is to be laid out in memory.

LangRef.html#data-layout gives the full description of the layout specification. Essentially, the layout spec. string (“e-m:o-i64:64-f80:128-n8:16:32:64-S128”) is a list of specifications separated by the minus sign. We can first break the string down into individual pieces and interpret each piece based on the language reference text.

e - Specifies that the target lays out data in little-endian form.
m:o - specifies that llvm names are mangled in the output, with a style option o for Mach-O mangling
i64:64 - specifies the alignment for an integer type of a given bit , using the format of i::
f80:128 - specifies the alignment for a floating-point type of a given bit , using the format of f::
n8:16:32:64 - specifies a set of native integer widths for the target CPU in bits, using the format of n::...
S128 - Specifies the natural alignment of the stack in bits, using the format of S

LangRef.html#target-triple gives the information about how to interpret the triple “x86_64-apple-macosx10.15.0”. It has the following canonical form: ARCHITECTURE-VENDOR-OPERATING_SYSTEM.

As you can see LLVM IR is machine dependent since it contains machine specific data layout and target information.

8. Function

The LLVM IR for the function is quite straightforward:

; Function Attrs: nounwind ssp uwtable
define i32 @foo(i32 %arg) #0 {
bb:
 ret i32 %arg
}

Line 6 starts with ‘;’ so it is a comment line. It has information about function attributes.

LIne 7 through 10 represent the function, with the following content:

starting with keyword “define”
return type “i32”, integer of 32-bit.
@foo: global scoped variables start with ‘@’ in LLVM IR. So this is a global scoped function foo()
parameter list: a list of type-var pair. In this example, integer 32bit type of %arg. Local-scoped variables start with ‘%’ in LLVM IR.
Basic Block: bb: is the label indicating the start of a basic block.
Instruction: the ret instruction for return, with one argument of i32 type.

To understand the full syntax and semantics of function, basic block and instructions, you can read the correponding sections in the language reference webpage, such as:

9. Function Attributes

LLVM uses a set of attributes to communicate additional information about a function. These attributes can help LLVM better understand the semantics of functions and generate more efficient code.

LangRef.htnml#function-attributes gives a full list of attributes. Here we only elaborate a few of them shown up at line 12:

nounwind: This function attribute indicates that the function never raises an exception.
ssp: This attribute indicates that the function should emit a stack smashing protector.
uwtable: This attribute indicates that the ABI being targeted requires that an unwind table entry be produced for this function even if we can show that no exceptions passes by it.

10. Named Metadata

The last portion of the IR is a collection of named metadata:

!llvm.module.flags = !{!0, !1}
!llvm.ident = !{!2}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 7, !"PIC Level", i32 2}
!2 = !{!"clang version 9.0.1 "}

The name “llvm.ident” is short for llvm identifier. It has a value !2, which in turn is “clang version 9.0.1’. So this named metadata shows the version of clang used to generate this IR file.

The llvm.module.flags metadata contains a list of metadata triplets to communicate information about the module as a whole. Each triplet has three elements:

behavior flag: specifies the behavior when two (or more) modules are merged together, and it encounters two (or more) metadata with the same ID. the value can be 1 through 7, indicating different behaviors such as emiting an error, emitting a warning, overriding with a pecified value, appending the two values, etc.
metadata string: uniqe ID for the metadata as its name.
value of the flag: the value for this metadata. It is used to compare against another module’s metadata value and trigger the specified behavior.

For example,

metadata !0 has the ID of “wchar_size” and the value of 4 (as shown at line 17). The behavior flag value is 1 (emitting an error if two values disagrees).
metadata !1 has the ID of “PIC Level” and the value of 2 (as shown at line 18). The behavior flag value is 7 (taking the max of the two values).

11. References

The following links are useful for further information:

LLVM Language Reference Manual: This is the LLVM IR specification or LLVM assembly languge reference. You can find syntax, semantics, and exmaples of all LLVM IR constructs.
Introduction to LLVM: Youtube Tutorial presented by E. Christopher & J. Doerfert, 2019 LLVM Developers’ Meeting. The command line options used by this tutorial come from this tutorial.

Source file for this page: link