Posts Tagged ‘CPU’

Performance Cost for Reducing the Number of Registers in X86-64

2017/05/08

When compiling a C program, the C compiler allocates CPU registers to program variables. For example, In the following compilation, the compiler assigns register rdi to variable a and register r15 to variable b.

int a = 1;
int b = 2;
b = a + b;
mov   $0x1,%rdi
mov   $0x2,%r15
add   %rdi,%r15

In the X86-64 architecture, there are 16 general purpose registers: rax, rbx, rcx, rdx, rbp, rsp, rsi, rdi, r8, r9, r10, r11, r12, r13, r14 and r15. When there are more variables than registers, the compiler has to use memory instead of registers. When a function calls another function, the caller has to save some of its registers to memory before the call and restore them from memory after the call. These two cases are called spilling, which is bad for performance because memory is much slower than register. Having more registers reduces spilling.

There are papers which modify the compiler and dedicate one or more registers for some specific purposes. For example, in order to track information flow, LIFT uses a dedicated register to store some information metadata. For another example, CPI hides a secret address in a dedicated register so that during arbitrary-memory-read attack, the attacker cannot read the address. Some SFI implementation uses a dedicated register to store jump target address, so that the original code cannot mess with it.

Besides spilling, another problem with having fewer registers is data dependencies. Modern CPUs can execute instructions in parallel or out of order as long as there are no dependencies between the instructions. In our previous example, the CPU can execute mov $0x1,%rdi and mov $0x2,%r15 in parallel. However, in the following example, the four instructions has to be executed in sequence because they all uses rax, thus every instruction depends on its previous one.

mov   $0x1,%rax      # rax = 1
lea   0x2(%rax),%r10 # r10 = rax + 2
mov   $0x3,%rax      # rax = 3
lea   0x4(%rax),%r11 # r11 = rax + 4

With more registers available, the compiler can generate faster code by renaming the rax register into another unused register in the last two instructions. After renaming, the first two and last two instruction can be executed in parallel. In modern CPUs, there are more physical registers than the number of named registers. During execution, a named register, say %rdi, is mapped to an underlying physical register. The Haswell microarchitectures has 168 physical registers. So, register renaming is performed by both compiler and CPU, i.e. there is some redundant work. In terms of data dependencies, maybe reducing the number of named registers isn’t so bad, because the CPU can still do renaming.

I think we have enough theoretical speculations. Let’s put it into some real tests. I have modified the LLVM/Clang compiler so that it uses fewer registers. In our evaluation, we have 4 versions of compilers:

  • C0: The original compiler: LLVM 4.0
  • C1: It doesn’t use r14.
  • C2: It doesn’t use r13, r14 or r15.
  • C3: It doesn’t use r11. The purpose of this is to see the difference between a callee-saved register (r14) and a caller-saved registers (r11).

Let’s first look at the difference of the binary produced by C0 and C1, by compiling the following C code using both compilers.

for (i = 0; i < 100000000; i++) {
    n0 = n1 + n4 * 8 + 46;
    n1 = n2 + n5 * 8 + 95;
    n2 = n3 + n6 * 1 + 55;
    n3 = n4 + n7 * 2 + 90;
    n4 = n5 + n8 * 1 + 58;
    n5 = n6 + n0 * 2 + 1 ;
    n6 = n7 + n1 * 4 + 59;
    n7 = n8 + n2 * 1 + 92;
    n8 = n0 + n3 * 4 + 64;
}

The following binary is produced by C0. Note that there is no memory operations at all.

  4004d0:       49 89 ff                mov    %rdi,%r15
  4004d3:       4c 89 f3                mov    %r14,%rbx
  4004d6:       49 8d 0c df             lea    (%r15,%rbx,8),%rcx
  4004da:       4d 8d 2c f0             lea    (%r8,%rsi,8),%r13
  4004de:       49 8d 7c f0 5f          lea    0x5f(%r8,%rsi,8),%rdi
  4004e3:       4d 8d 44 13 37          lea    0x37(%r11,%rdx,1),%r8
  4004e8:       4c 89 dd                mov    %r11,%rbp
  4004eb:       48 01 d5                add    %rdx,%rbp
  4004ee:       4e 8d 14 63             lea    (%rbx,%r12,2),%r10
  4004f2:       4e 8d 5c 63 5a          lea    0x5a(%rbx,%r12,2),%r11
  4004f7:       4e 8d 74 0e 3a          lea    0x3a(%rsi,%r9,1),%r14
  4004fc:       48 8d 74 4a 5d          lea    0x5d(%rdx,%rcx,2),%rsi
  400501:       4b 8d 94 ac b7 01 00    lea    0x1b7(%r12,%r13,4),%rdx
  400508:       00
  400509:       4d 8d a4 29 93 00 00    lea    0x93(%r9,%rbp,1),%r12
  400510:       00
  400511:       4e 8d 8c 91 d6 01 00    lea    0x1d6(%rcx,%r10,4),%r9
  400518:       00
  400519:       ff c8                   dec    %eax
  40051b:       75 b3                   jne    4004d0 <compute+0x40>

The following is produced by C1. Note that r14 is absent, because the compiler doesn’t know the existence of c14. There are two memory operations: one store operation at 4004e3, one load operation at 400516.

  4004d0:       49 89 ec                mov    %rbp,%r12
  4004d3:       4c 89 fb                mov    %r15,%rbx
  4004d6:       49 8d 0c dc             lea    (%r12,%rbx,8),%rcx
  4004da:       49 8d 2c f0             lea    (%r8,%rsi,8),%rbp
  4004de:       49 8d 7c f0 5f          lea    0x5f(%r8,%rsi,8),%rdi
  4004e3:       48 89 7c 24 f8          mov    %rdi,-0x8(%rsp)
  4004e8:       4d 8d 44 12 37          lea    0x37(%r10,%rdx,1),%r8
  4004ed:       4d 89 d3                mov    %r10,%r11
  4004f0:       49 01 d3                add    %rdx,%r11
  4004f3:       4a 8d 3c 6b             lea    (%rbx,%r13,2),%rdi
  4004f7:       4e 8d 54 6b 5a          lea    0x5a(%rbx,%r13,2),%r10
  4004fc:       4e 8d 7c 0e 3a          lea    0x3a(%rsi,%r9,1),%r15
  400501:       48 8d 74 4a 5d          lea    0x5d(%rdx,%rcx,2),%rsi
  400506:       49 8d 94 ad b7 01 00    lea    0x1b7(%r13,%rbp,4),%rdx
  40050d:       00
  40050e:       4f 8d ac 19 93 00 00    lea    0x93(%r9,%r11,1),%r13
  400515:       00
  400516:       48 8b 6c 24 f8          mov    -0x8(%rsp),%rbp
  40051b:       4c 8d 8c b9 d6 01 00    lea    0x1d6(%rcx,%rdi,4),%r9
  400522:       00
  400523:       ff c8                   dec    %eax
  400525:       75 a9                   jne    4004d0 <compute+0x40>

To measure the performance, we use SPEC CPU 2006 benchmark suit. Below is the result of the test cases.

benchresult

Let’s look at the result.

  • For most cases, C0 is the fastest, except sjeng and libquantum. Assuming bug-free compiler, there is no way to explain C0 being slower. My guess is experimental error since the variance for the two cases is quite large.
  • C2 has least registers available, and is the slowest for 6 out of 12 cases.
  • C1 is slower than C3 in most cases. This means a callee-saved register is more performance critical than a caller-saved register.
  • The overall overhead for C1 is XXX, C2 is XXX, C3 is XXX. (TODO)
Advertisements