Thread Local Storage on macOS

0001-01-01

How are thread local variables implemented on macOS ? Through some debugging, Googling and reading the source, lets figure out how.

A thread local variable is a variable declared in the program that is accessed like a common variable, but each thread has a unique copy of the variable. When a thread in the program modifies this variable, the modification is visible only to the thread that modified it and not the other threads (consequently, the other threads might see other values based on if and when they modify the variable)

Lets take an example. The variable i is thread local (__thread is a GNU extension to the C language). We create 3 threads, each of which increment i by 1. If all the threads add 1 to i, and i were a normal variable, then i must be 14. But since i is defined as thread local, each thread gets a copy of i, initialized to 10 and then each thread increments it by 1, equaling 11 in each thread. We can verify this behaviour.

#include<stdio.h>
#include<pthread.h>

__thread int i = 10;
void increment_i(void);
int main() {

  pthread_t t1, t2, t3;
  int ret1, ret2, ret3;
  ret1 = pthread_create(&t1, NULL, (void *)increment_i, NULL);
  ret2 = pthread_create(&t2, NULL, (void *)increment_i, NULL);
  ret3 = pthread_create(&t3, NULL, (void *)increment_i, NULL);

  pthread_join(t1, NULL);
  pthread_join(t2, NULL);
  pthread_join(t3, NULL);

  return 0;
}


void increment_i() {
  int k = i + 1;
  i = k;
  printf("value of k is %d\n", k);
}

compiling it gcc tlocal.c -o tlocal -lpthread

and executing it

./tlocal
value of k is 11
value of k is 11
value of k is 11

So how are thread local variables implemented ?

To understand this, we need to setup a breakpoint in increment_i to observe the behavior. Lets load up our executable in lldb and step through the code.

>lldb --file tlocal

(lldb) b increment_i
Breakpoint 1: where = tlocal`increment_i, address = 0x0000000100003f64
(lldb) run
Process 95169 launched: '/Users/govind/gowind-whisper/tlocal' (arm64)
Process 95169 stopped
* thread #2, stop reason = breakpoint 1.1
    frame #0: 0x0000000100003f64 tlocal`increment_i
tlocal`increment_i:
->  0x100003f64 <+0>:  sub    sp, sp, #0x20
    0x100003f68 <+4>:  stp    x29, x30, [sp, #0x10]
    0x100003f6c <+8>:  add    x29, sp, #0x10
    0x100003f70 <+12>: adrp   x0, 5
  thread #3, stop reason = breakpoint 1.1
    frame #0: 0x0000000100003f64 tlocal`increment_i
tlocal`increment_i:
->  0x100003f64 <+0>:  sub    sp, sp, #0x20
    0x100003f68 <+4>:  stp    x29, x30, [sp, #0x10]
    0x100003f6c <+8>:  add    x29, sp, #0x10
    0x100003f70 <+12>: adrp   x0, 5
  thread #4, stop reason = breakpoint 1.1
    frame #0: 0x0000000100003f64 tlocal`increment_i
tlocal`increment_i:
->  0x100003f64 <+0>:  sub    sp, sp, #0x20
    0x100003f68 <+4>:  stp    x29, x30, [sp, #0x10]
    0x100003f6c <+8>:  add    x29, sp, #0x10
    0x100003f70 <+12>: adrp   x0, 5
Target 0: (tlocal) stopped.
(lldb)

3 threads are launched and all stop at the beginning of increment_i Lets select one thread and step through it, to make it less confusing.

thread select 4
* thread #4, stop reason = breakpoint 1.1
    frame #0: 0x0000000100003f64 tlocal`increment_i
tlocal`increment_i:
->  0x100003f64 <+0>:  sub    sp, sp, #0x20
    0x100003f68 <+4>:  stp    x29, x30, [sp, #0x10]
    0x100003f6c <+8>:  add    x29, sp, #0x10
    0x100003f70 <+12>: adrp   x0, 5

Now, what we are interested in is the adrp instruction. As covered in the previous post , we use the GOT (global offset table) to load the address of a global variable. The address is at a fixed offset from the base of the GOT. We then fetch the value of the variable from the address. adrp sets x0 to an offset into the GOT (GOT + some value), where we will find the address of our global variable. we then load the actual value of our variable from this address (using ldr x0, [x0])

In this case, we can’t use a GOT. Why ?

All threads share the same memory space. If var i is stored at address x, then all threads will see the same address x when they try to load the value for i. In increment_i, we aren’t using any thread index or other thread specific identifiers (atleast directly) to load a different address in each thread, so that each thread gets a copy of i. How does it work then ?

The answers is how thread local storage works. Thread local data (variables) are stored in a section called thread_vars and thread_bss (tdata and tbss on Linux systems). You can see them by dumping the sections present in the image (executable file ) in lldb

(lldb) image dump sections tlocal
Sections for '/Users/govind/gowind-whisper/tlocal' (arm64):
  SectID     Type             Load Address                             Perm File Off.  File Size  Flags      Section Name
  ---------- ---------------- ---------------------------------------  ---- ---------- ---------- ---------- ----------------------------
  0x00000100 container        [0x0000000000000000-0x0000000100000000)* ---  0x00000000 0x00000000 0x00000000 tlocal.__PAGEZERO
  0x00000200 container        [0x0000000100000000-0x0000000100004000)  r-x  0x00000000 0x00004000 0x00000000 tlocal.__TEXT
....
  0x00000005 regular          [0x0000000100008000-0x0000000100008018)  rw-  0x00008000 0x00000018 0x00000013 tlocal.__DATA.__thread_vars
  0x00000006 regular          [0x0000000100008018-0x000000010000801c)  rw-  0x00000000 0x00000000 0x00000012 tlocal.__DATA.__thread_bss
  0x00000500 container        [0x000000010000c000-0x0000000100010000)  r--  0x0000c000 0x00000403 0x00000000 tlocal.__LINKEDIT

Our thread local variables seem to have an address somewhere between 0x0000000100008000-0x000000010000801c. Let us see what is the address we get for i when we read the value of i in each thread

(lldb) stepi -c 3
...
* thread #4, stop reason = instruction step into
    frame #0: 0x0000000100003f70 tlocal`increment_i + 12
tlocal`increment_i:
->  0x100003f70 <+12>: adrp   x0, 5
    0x100003f74 <+16>: add    x0, x0, #0x0              ; i
    0x100003f78 <+20>: ldr    x8, [x0]
    0x100003f7c <+24>: blr    x8
...
* thread #4, stop reason = instruction step into
    frame #0: 0x0000000100003f74 tlocal`increment_i + 16
tlocal`increment_i:
->  0x100003f74 <+16>: add    x0, x0, #0x0
    0x100003f78 <+20>: ldr    x8, [x0]
    0x100003f7c <+24>: blr    x8
    0x100003f80 <+28>: ldr    w8, [x0]
Target 0: (tlocal) stopped.
(lldb) register read x0
      x0 = 0x0000000100008000  tlocal`i
(lldb) stepi
...
* thread #4, stop reason = instruction step into
    frame #0: 0x0000000100003f78 tlocal`increment_i + 20
tlocal`increment_i:
->  0x100003f78 <+20>: ldr    x8, [x0]
    0x100003f7c <+24>: blr    x8
    0x100003f80 <+28>: ldr    w8, [x0]
    0x100003f84 <+32>: add    w8, w8, #0x1
Target 0: (tlocal) stopped.
(lldb) register read x0
      x0 = 0x0000000100008000  tlocal`i
(lldb) stepi
* thread #4, stop reason = instruction step into
    frame #0: 0x0000000100003f7c tlocal`increment_i + 24
tlocal`increment_i:
->  0x100003f7c <+24>: blr    x8
    0x100003f80 <+28>: ldr    w8, [x0]
    0x100003f84 <+32>: add    w8, w8, #0x1
    0x100003f88 <+36>: stur   w8, [x29, #-0x4]
Target 0: (tlocal) stopped.
(lldb) register read x8
      x8 = 0x000000018c84d120  libdyld.dylib`tlv_get_addr
(lldb)

adrp x0 gives us the value 0x0000000100008000. Reading the value stored at this address, we see that this value is not the value of i proper, but is in turn, another address, that points to a fn tlv_get_addr instead.

tlv_get_addr, seems to be an macOS specific fn, which is part of the dynamic linker (dyld) on macOS. dyld seems to be mapping itself into the address space of the process, so that the process can , in situations like these, can use fns present in dyld . We can check this using image lookup.

(lldb) image lookup -r -n tlv_get_addr #lookup a fn using a regex value tlv_get_addr
1 match found in /usr/lib/system/libdyld.dylib:
        Address: libdyld.dylib[0x00000001803d5120] (libdyld.dylib.__TEXT.__text + 3088)
        Summary: libdyld.dylib`tlv_get_addr

Stepping over this call to tlv_get_addr, we see that this fns returns a value in x0. THIS seems to be the actual address of our variable i and in the next instructions, we are adding 1 to our variable i (k = i + 1 in our code)

(lldb) thread step-over
Process 95169 stopped
* thread #4, stop reason = instruction step over
    frame #0: 0x0000000100003f80 tlocal`increment_i + 28
tlocal`increment_i:
->  0x100003f80 <+28>: ldr    w8, [x0]
    0x100003f84 <+32>: add    w8, w8, #0x1
    0x100003f88 <+36>: stur   w8, [x29, #-0x4]
    0x100003f8c <+40>: ldur   w8, [x29, #-0x4]
Target 0: (tlocal) stopped.
(lldb) register read x0
      x0 = 0x000060000000c000
(lldb) stepi
Process 95169 stopped
* thread #4, stop reason = instruction step into
    frame #0: 0x0000000100003f84 tlocal`increment_i + 32
tlocal`increment_i:
->  0x100003f84 <+32>: add    w8, w8, #0x1
    0x100003f88 <+36>: stur   w8, [x29, #-0x4]
    0x100003f8c <+40>: ldur   w8, [x29, #-0x4]
    0x100003f90 <+44>: str    w8, [x0]
Target 0: (tlocal) stopped.
(lldb) register read w8
      w8 = 0x0000000a

Switching to a different thread, we can see that after the blr x8 call, our x0 has a different address, which according to thread #3, is the address of the variable i

(lldb) thread select 3
* thread #3
    frame #0: 0x0000000100003f80 tlocal`increment_i + 28
tlocal`increment_i:
->  0x100003f80 <+28>: ldr    w8, [x0]
    0x100003f84 <+32>: add    w8, w8, #0x1
    0x100003f88 <+36>: stur   w8, [x29, #-0x4]
    0x100003f8c <+40>: ldur   w8, [x29, #-0x4]
(lldb) register read x0
      x0 = 0x0000600000010000
(lldb) stepi
Process 95169 stopped
* thread #3, stop reason = instruction step into
    frame #0: 0x0000000100003f84 tlocal`increment_i + 32
tlocal`increment_i:
->  0x100003f84 <+32>: add    w8, w8, #0x1
    0x100003f88 <+36>: stur   w8, [x29, #-0x4]
    0x100003f8c <+40>: ldur   w8, [x29, #-0x4]
    0x100003f90 <+44>: str    w8, [x0]
  thread #4, stop reason = trace
    frame #0: 0x0000000100003f84 tlocal`increment_i + 32
tlocal`increment_i:
->  0x100003f84 <+32>: add    w8, w8, #0x1
    0x100003f88 <+36>: stur   w8, [x29, #-0x4]
    0x100003f8c <+40>: ldur   w8, [x29, #-0x4]
    0x100003f90 <+44>: str    w8, [x0]
Target 0: (tlocal) stopped.
(lldb) register read w8
      w8 = 0x0000000a

Thread local variables are thus, through one simple indirection, available to userspace programs on macOS. Linux provides a similar call tls_get_addr. You can find more documentation on how TLS works in here

Reply to this post by email ↪