thb
thb

Reputation: 14424

thread_local at block scope

What is the use of a thread_local variable at block scope?

If a compilable sample helps to illustrate the question, here it is:

#include <thread>
#include <iostream>

namespace My {
    void f(int *const p) {++*p;}
}

int main()
{
    thread_local int n {42};
    std::thread t(My::f, &n);
    t.join();
    std::cout << n << "\n";
    return 0;
}

Output: 43

In the sample, the new thread gets its own n but (as far as I know) can do nothing interesting with it, so why bother? Does the new thread's own n have any use? And if it has no use, then what is the point?

Naturally, I assume that there is a point. I just do not know what the point might be. This is why I ask.

If the new thread's own n wants (as I suppose) special handling by the CPU at runtime—perhaps because, at the machine-code level, one cannot access the own n in the normal way via a precalculated offset from the base pointer of the new thread's stack—then are we not merely wasting machine cycles and electricity for no gain? And yet even if special handling were not required, still no gain! Not that I can see.

So why thread_local at block scope, please?

References

Upvotes: 8

Views: 2458

Answers (4)

Lewis Kelsey
Lewis Kelsey

Reputation: 4677

static thread_local and thread_local at block scope are equivalent; thread_local has a thread storage duration, not static or automatic; therefore, static and automatic specifiers i.e. thread_local, which is auto thread_local, and static thread_local have no effect on the storage duration; semantically, they are nonsense to use and they're just implicitly taken to mean thread storage duration, due to the presence of thread_local; static doesn't even modify the linkage at block scope either (because it's always no linkage), so it has no other definition other than modifying storage duration. extern thread_local is also possible in block scope. static thread_local at file scope gives the thread_local variable internal linkage, which means there will be one copy per translation unit in the TLS (each translation unit will resolve to its own variable at the TLS index for the .exe, because the assembler will insert the variable in the rdata$t section of the .o file and mark it in the symbol table as a local symbol due to the lack of the .global directive on the symbol). extern thread_local at file scope is legal like it is at block scope and uses the thread_local copy defined in another translation unit. thread_local at file scope is not implicitly static, because it can provide a global symbol definition for another translation unit, which cannot be done by a block-scope variable.

The compiler will store all initialised thread_local variables in .tdata (including block-scope ones) for ELF and uninitialised ones in .tbss for ELF, or all in .tls for PE format. I presume the thread library, when creating a thread, will access the .tls segment and perform windows API calls (TlsAlloc and TlsSetValue), which allocate the variables for each .exe and .dll on the heap and places a pointers in the TLS array of the thread's TEB in the GS segment and returns the index allocated, as well as call DLL_THREAD_ATTACH routines for dynamic libraries. Presumably, a pointer to a value in the space defined by _tls_start and _tls_end is what's passed to TlsSetValue as the value pointer.

The difference between file scope static/extern thread_local and block scope (extern) thread_local is the same general difference between file scope static/extern and block scope static/extern, in that the block scope thread_local variable will go out of scope at the end of the function it is defined in, although it can still be returned and accessed by address because of the thread storage duration.

The compiler knows the index of the data in the .tls segment, so it can substitute accesses the GS segment directly, as can be seen on godbolt.

MSVC

thread_local int a = 5;

int square(int num) {
thread_local int i = 5;
    return a * i;
}
_TLS    SEGMENT
int a DD        05H                           ; a
_TLS    ENDS
_TLS    SEGMENT
int `int square(int)'::`2'::i DD 05H                        ; `square'::`2'::i
_TLS    ENDS

num$ = 8
int square(int) PROC                                    ; square
        mov     DWORD PTR [rsp+8], ecx
        mov     eax, OFFSET FLAT:int a      ; a
        mov     eax, eax
        mov     ecx, DWORD PTR _tls_index
        mov     rdx, QWORD PTR gs:88
        mov     rcx, QWORD PTR [rdx+rcx*8]
        mov     edx, OFFSET FLAT:int `int square(int)'::`2'::i
        mov     edx, edx
        mov     r8d, DWORD PTR _tls_index
        mov     r9, QWORD PTR gs:88
        mov     r8, QWORD PTR [r9+r8*8]
        mov     eax, DWORD PTR [rcx+rax]
        imul    eax, DWORD PTR [r8+rdx]
        ret     0
int square(int) ENDP                                    ; square

This loads a 64 bit pointer from gs:88 (gs:[0x58], which is the linear address of the thread-local storage array), then loads a 64 bit pointer using the TLS array pointer + _tls_index*8 (this is obviously locating the index in the array * pointer size). Int a; is then loaded from this pointer + offset into the .tls segment. Seeing as both variables use the same _tls_index, it suggests that there is an index per .exe, i.e. per .tls section, indeed there is one _tls_index per TLS directory in .rdata, and the variables are packed together at the address pointed to by the TLS array. static thread_local variables in different translation units will be merged into .tls and all be packed together at the same index.

I believe that mainCRTStartup, which the linker always includes in the final executable and makes it the entry point if it is being linked as a console application, references the _tls_used variable (because every .exe needs its own index) and it was pragma'd to go in the T fragment of .rdata in whatever object file within libcmt.lib defines it (and because mainCRTStartup references it the linker will include it in the final executable). If the linker finds a reference to a _tls_used variable, it will make sure to include it and make sure the PE header TLS directory points to it.

#pragma section(".rdata$T", long, read)    //creates a read only section called `.rdata` if not created and a fragment T in the section
#define _CRTALLOC(x) __declspec(allocate(x))
#pragma data_seg()   //set the compilers current default data section to `.data`

_CRTALLOC(".rdata$T")  //place in the section .rdata, fragment T
const IMAGE_TLS_DIRECTORY _tls_used =
{
 (ULONG)(ULONG_PTR) &_tls_start, // start of tls data in the tls section
 (ULONG)(ULONG_PTR) &_tls_end,   // end of tls data
 (ULONG)(ULONG_PTR) &_tls_index, // address of tls_index
 (ULONG)(ULONG_PTR) (&__xl_a+1), // pointer to callbacks
 (ULONG) 0,                      // size of tls zero fill
 (ULONG) 0                       // characteristics
};

http://www.nynaeve.net/?p=183

_tls_used is a variable of type IMAGE_TLS_DIRECTORY structure, with the above initialised content, and it's actually defined in tlssup.c. Prior to this, it defines _tls_index, _tls_start and _tls_end, placing _tls_start at the start of the .tls section and _tls_end at the end of the .tls section by placing it in the section fragmentZZZ such that it alphabetically ends up at the end of the section:

#pragma data_seg(".tls") //set the compilers current default data section to `.tls`

#if defined (_M_IA64) || defined (_M_AMD64)
_CRTALLOC(".tls")   //place the following in the section named `.tls`
#endif
char _tls_start = 0;   //if not defined, place in the current default data section, which is also `.tls`

#pragma data_seg(".tls$ZZZ")

#if defined (_M_IA64) || defined (_M_AMD64)
_CRTALLOC(".tls$ZZZ")
#endif
char _tls_end = 0;

The addresses of these are then used as markers in the _tls_used TLS directory. The address will only be resolved by the linker when the .tls section is complete and it has a fixed relative lea location.

GCC (TLS is directly before FS base; raw data rather than pointers)

 mov    edx,DWORD PTR fs:0xfffffffffffffff8 //access thread_local int1 inside function
 mov    eax,DWORD PTR fs:0xfffffffffffffffc //access thread_local int2 inside function

Making one, both or none of the variables local produces identical code.

When the thread execution terminates, the thread library on windows will deallocate the storage using TlsFree() calls (it also must deallocate the memory on the heap pointed to the pointer returned by TlsGetValue()).

Upvotes: 4

Philipp Cla&#223;en
Philipp Cla&#223;en

Reputation: 43940

First note that a block-local thread-local is implicitly static thread_local. In other words, your example code is equivalent to:

int main()
{
    static thread_local int n {42};
    std::thread t(My::f, &n);
    t.join();
    std::cout << n << "\n"; // prints 43
    return 0;
}

Variables declared with thread_local inside a function are not so different from globally defined thread_locals. In both cases, you create an object that is unique per thread and whose lifetime is bound to the lifetime of the thread.

The difference is only that globally defined thread_locals will be initialized when the new thread is run before you enter any thread-specific functions. In contrast, a block-local thread-local variable is initialized the first time control passes through its declaration.

A use case would be to speed up a function by defining a local cache that is reused during the lifetime of the thread:

void foo() {
  static thread_local MyCache cache;
  // ...
}

(I used static thread_local here to make it explicit that the cache will be reused if the function is executed multiple times within the same thread, but it is a matter of taste. If you drop the static, it will not make any difference.)


A comment about your the example code. Maybe it was intentional, but the thread is not really accessing the thread_local n. Instead it operates on a copy of a pointer, which was created by the thread running main. Because of that both threads refer to the same memory.

In other words, a more verbose way would have been:

int main()
{
    thread_local int n {42};
    int* n_ = &n;
    std::thread t(My::f, n_);
    t.join();
    std::cout << n << "\n"; // prints 43
    return 0;
}

If you change the code, so the thread accesses n, it will operate on its own version, and n belonging to the main thread will not be modified:

int main()
{
    thread_local int n {42};
    std::thread t([&] { My::f(&n); });
    t.join();
    std::cout << n << "\n"; // prints 42 (not 43)
    return 0;
}

Here is a more complicated example. It calls the function two times to show that the state is preserved between the calls. Also its output shows that the threads operate on their own state:

#include <iostream>
#include <thread>

void foo() {
  thread_local int n = 1;
  std::cout << "n=" << n << " (main)" << std::endl;
  n = 100;
  std::cout << "n=" << n << " (main)" << std::endl;
  int& n_ = n;
  std::thread t([&] {
          std::cout << "t executing...\n";
          std::cout << "n=" << n << " (thread 1)\n";
          std::cout << "n_=" << n_ << " (thread 1)\n";
          n += 1;
          std::cout << "n=" << n << " (thread 1)\n";
          std::cout << "n_=" << n_ << " (thread 1)\n";
          std::cout << "t executing...DONE" << std::endl;
        });
  t.join();
  std::cout << "n=" << n << " (main, after t.join())\n";
  n = 200;
  std::cout << "n=" << n << " (main)" << std::endl;

  std::thread t2([&] {
          std::cout << "t2 executing...\n";
          std::cout << "n=" << n << " (thread 2)\n";
          std::cout << "n_=" << n_ << " (thread 2)\n";
          n += 1;
          std::cout << "n=" << n << " (thread 2)\n";
          std::cout << "n_=" << n_ << " (thread 2)\n";
          std::cout << "t2 executing...DONE" << std::endl;
        });
  t2.join();
  std::cout << "n=" << n << " (main, after t2.join())" << std::endl;
}

int main() {
  foo();
  std::cout << "---\n";
  foo();
  return 0;
}

Output:

n=1 (main)
n=100 (main)
t executing...
n=1 (thread 1)      # the thread used the "n = 1" init code
n_=100 (thread 1)   # the passed reference, not the thread_local
n=2 (thread 1)      # write to the thread_local
n_=100 (thread 1)   # did not change the passed reference
t executing...DONE
n=100 (main, after t.join())
n=200 (main)
t2 executing...
n=1 (thread 2)
n_=200 (thread 2)
n=2 (thread 2)
n_=200 (thread 2)
t2 executing...DONE
n=200 (main, after t2.join())
---
n=200 (main)        # second execution: old state is reused
n=100 (main)
t executing...
n=1 (thread 1)
n_=100 (thread 1)
n=2 (thread 1)
n_=100 (thread 1)
t executing...DONE
n=100 (main, after t.join())
n=200 (main)
t2 executing...
n=1 (thread 2)
n_=200 (thread 2)
n=2 (thread 2)
n_=200 (thread 2)
t2 executing...DONE
n=200 (main, after t2.join())

Upvotes: 2

Cruz Jean
Cruz Jean

Reputation: 2819

I find thread_local is only useful in three cases:

  1. If you need each thread to have a unique resource so that they don't have to share, mutex, etc. for using said resource. And even so, this is only useful if the resource is large and/or expensive to create or needs to persist across function invocations (i.e. a local variable inside the function will not suffice).

  2. An offshoot of (1) - you may need special logic to run when a calling thread eventually terminates. For this, you can use the destructor of the thread_local object created in the function. The destructor of such a thread_local object is called once for each thread that entered the code block with the thread_local declaration (at the end of the thread's lifetime).

  3. You may need some other logic to be performed for each unique thread that calls it, but only once. For instance, you could write a function that registers each unique thread that called a function. This may sound bizarre, but I've found uses for this in managing garbage-collected resources in a library I'm developing. This usage is closely-related to (1) but doesn't get used after its construction. Effectively a sentry object for a thread's entire lifetime.

Upvotes: 5

Michael Kenzel
Michael Kenzel

Reputation: 15933

Putting aside the great examples already given by Cruz Jean (I don't think I could add to those), consider also the following: there's no reason to forbid it. I don't think you doubt the usefulness of thread_local or question why it should be in the language in general. There is a well-defined meaning for a thread_local block-scope variable simply as a result of how storage classes and scopes work in C++. Just because one cannot think of something "interesting" to do with every possible combination of language features doesn't mean that all combinations of language features that don't have at least one known "interesting" application must explicitly be disallowed. By that logic, we'd also have to go ahead and disallow classes with no private members from having friends and whatnot. At least to me, C++ in particular seems to follow a philosophy of "if there's no specific technical reason why feature X cannot work in situation Y, then there's no reason to forbid it", which I would consider a quite healthy approach. Forbidding things for no good reason means adding complexity for no good reason. And I believe everyone would agree that there's already enough complexity in C++. It also prevents happy accidents like when, only after many years, a certain language feature is suddenly discovered to have previously unthought-of applications. The most prominent example of such a case would probably be templates which (at least as far as I'm aware) were not originally conceived with the purpose of metaprogramming in mind; it just turned out later that they could also be used for that…

Upvotes: 1

Related Questions