Reputation: 14424
What is the use of a thread_local
variable at block scope?
If a compilable sample helps to illustrate the question, here it is:
#include <thread>
#include <iostream>
namespace My {
void f(int *const p) {++*p;}
}
int main()
{
thread_local int n {42};
std::thread t(My::f, &n);
t.join();
std::cout << n << "\n";
return 0;
}
Output: 43
In the sample, the new thread gets its own n
but (as far as I know) can do nothing interesting with it, so why bother? Does the new thread's own n
have any use? And if it has no use, then what is the point?
Naturally, I assume that there is a point. I just do not know what the point might be. This is why I ask.
If the new thread's own n
wants (as I suppose) special handling by the CPU at runtime—perhaps because, at the machine-code level, one cannot access the own n
in the normal way via a precalculated offset from the base pointer of the new thread's stack—then are we not merely wasting machine cycles and electricity for no gain? And yet even if special handling were not required, still no gain! Not that I can see.
So why thread_local
at block scope, please?
References
thread_local
and other storage classesUpvotes: 8
Views: 2458
Reputation: 4677
static thread_local
and thread_local
at block scope are equivalent; thread_local
has a thread storage duration, not static or automatic; therefore, static and automatic specifiers i.e. thread_local
, which is auto thread_local
, and static thread_local
have no effect on the storage duration; semantically, they are nonsense to use and they're just implicitly taken to mean thread storage duration, due to the presence of thread_local
; static doesn't even modify the linkage at block scope either (because it's always no linkage), so it has no other definition other than modifying storage duration. extern thread_local
is also possible in block scope. static thread_local
at file scope gives the thread_local
variable internal linkage, which means there will be one copy per translation unit in the TLS (each translation unit will resolve to its own variable at the TLS index for the .exe
, because the assembler will insert the variable in the rdata$t
section of the .o
file and mark it in the symbol table as a local symbol due to the lack of the .global
directive on the symbol). extern thread_local
at file scope is legal like it is at block scope and uses the thread_local
copy defined in another translation unit. thread_local
at file scope is not implicitly static, because it can provide a global symbol definition for another translation unit, which cannot be done by a block-scope variable.
The compiler will store all initialised thread_local
variables in .tdata
(including block-scope ones) for ELF and uninitialised ones in .tbss
for ELF, or all in .tls
for PE format. I presume the thread library, when creating a thread, will access the .tls
segment and perform windows API calls (TlsAlloc
and TlsSetValue
), which allocate the variables for each .exe
and .dll
on the heap and places a pointers in the TLS array of the thread's TEB in the GS segment and returns the index allocated, as well as call DLL_THREAD_ATTACH
routines for dynamic libraries. Presumably, a pointer to a value in the space defined by _tls_start
and _tls_end
is what's passed to TlsSetValue
as the value pointer.
The difference between file scope static/extern thread_local
and block scope (extern) thread_local
is the same general difference between file scope static/extern
and block scope static/extern
, in that the block scope thread_local
variable will go out of scope at the end of the function it is defined in, although it can still be returned and accessed by address because of the thread storage duration.
The compiler knows the index of the data in the .tls segment, so it can substitute accesses the GS segment directly, as can be seen on godbolt.
MSVC
thread_local int a = 5;
int square(int num) {
thread_local int i = 5;
return a * i;
}
_TLS SEGMENT
int a DD 05H ; a
_TLS ENDS
_TLS SEGMENT
int `int square(int)'::`2'::i DD 05H ; `square'::`2'::i
_TLS ENDS
num$ = 8
int square(int) PROC ; square
mov DWORD PTR [rsp+8], ecx
mov eax, OFFSET FLAT:int a ; a
mov eax, eax
mov ecx, DWORD PTR _tls_index
mov rdx, QWORD PTR gs:88
mov rcx, QWORD PTR [rdx+rcx*8]
mov edx, OFFSET FLAT:int `int square(int)'::`2'::i
mov edx, edx
mov r8d, DWORD PTR _tls_index
mov r9, QWORD PTR gs:88
mov r8, QWORD PTR [r9+r8*8]
mov eax, DWORD PTR [rcx+rax]
imul eax, DWORD PTR [r8+rdx]
ret 0
int square(int) ENDP ; square
This loads a 64 bit pointer from gs:88
(gs:[0x58]
, which is the linear address of the thread-local storage array), then loads a 64 bit pointer using the TLS array pointer + _tls_index*8
(this is obviously locating the index in the array * pointer size). Int a;
is then loaded from this pointer + offset into the .tls segment. Seeing as both variables use the same _tls_index
, it suggests that there is an index per .exe, i.e. per .tls section, indeed there is one _tls_index
per TLS directory in .rdata
, and the variables are packed together at the address pointed to by the TLS array. static thread_local
variables in different translation units will be merged into .tls and all be packed together at the same index.
I believe that mainCRTStartup
, which the linker always includes in the final executable and makes it the entry point if it is being linked as a console application, references the _tls_used
variable (because every .exe needs its own index) and it was pragma'd to go in the T fragment of .rdata
in whatever object file within libcmt.lib
defines it (and because mainCRTStartup
references it the linker will include it in the final executable). If the linker finds a reference to a _tls_used
variable, it will make sure to include it and make sure the PE header TLS directory points to it.
#pragma section(".rdata$T", long, read) //creates a read only section called `.rdata` if not created and a fragment T in the section
#define _CRTALLOC(x) __declspec(allocate(x))
#pragma data_seg() //set the compilers current default data section to `.data`
_CRTALLOC(".rdata$T") //place in the section .rdata, fragment T
const IMAGE_TLS_DIRECTORY _tls_used =
{
(ULONG)(ULONG_PTR) &_tls_start, // start of tls data in the tls section
(ULONG)(ULONG_PTR) &_tls_end, // end of tls data
(ULONG)(ULONG_PTR) &_tls_index, // address of tls_index
(ULONG)(ULONG_PTR) (&__xl_a+1), // pointer to callbacks
(ULONG) 0, // size of tls zero fill
(ULONG) 0 // characteristics
};
_tls_used
is a variable of type IMAGE_TLS_DIRECTORY
structure, with the above initialised content, and it's actually defined in tlssup.c
. Prior to this, it defines _tls_index
, _tls_start
and _tls_end
, placing _tls_start
at the start of the .tls
section and _tls_end
at the end of the .tls
section by placing it in the section fragmentZZZ
such that it alphabetically ends up at the end of the section:
#pragma data_seg(".tls") //set the compilers current default data section to `.tls`
#if defined (_M_IA64) || defined (_M_AMD64)
_CRTALLOC(".tls") //place the following in the section named `.tls`
#endif
char _tls_start = 0; //if not defined, place in the current default data section, which is also `.tls`
#pragma data_seg(".tls$ZZZ")
#if defined (_M_IA64) || defined (_M_AMD64)
_CRTALLOC(".tls$ZZZ")
#endif
char _tls_end = 0;
The addresses of these are then used as markers in the _tls_used
TLS directory. The address will only be resolved by the linker when the .tls
section is complete and it has a fixed relative lea
location.
GCC (TLS is directly before FS base; raw data rather than pointers)
mov edx,DWORD PTR fs:0xfffffffffffffff8 //access thread_local int1 inside function
mov eax,DWORD PTR fs:0xfffffffffffffffc //access thread_local int2 inside function
Making one, both or none of the variables local produces identical code.
When the thread execution terminates, the thread library on windows will deallocate the storage using TlsFree()
calls (it also must deallocate the memory on the heap pointed to the pointer returned by TlsGetValue()
).
Upvotes: 4
Reputation: 43940
First note that a block-local thread-local is implicitly static thread_local. In other words, your example code is equivalent to:
int main()
{
static thread_local int n {42};
std::thread t(My::f, &n);
t.join();
std::cout << n << "\n"; // prints 43
return 0;
}
Variables declared with thread_local
inside a function are not so different from globally defined thread_locals. In both cases, you create an object that is unique per thread and whose lifetime is bound to the lifetime of the thread.
The difference is only that globally defined thread_locals will be initialized when the new thread is run before you enter any thread-specific functions. In contrast, a block-local thread-local variable is initialized the first time control passes through its declaration.
A use case would be to speed up a function by defining a local cache that is reused during the lifetime of the thread:
void foo() {
static thread_local MyCache cache;
// ...
}
(I used static thread_local
here to make it explicit that the cache will be reused if the function is executed multiple times within the same thread, but it is a matter of taste. If you drop the static
, it will not make any difference.)
A comment about your the example code. Maybe it was intentional, but the thread is not really accessing the thread_local n
. Instead it operates on a copy of a pointer, which was created by the thread running main
. Because of that both threads refer to the same memory.
In other words, a more verbose way would have been:
int main()
{
thread_local int n {42};
int* n_ = &n;
std::thread t(My::f, n_);
t.join();
std::cout << n << "\n"; // prints 43
return 0;
}
If you change the code, so the thread accesses n
, it will operate on its own version, and n
belonging to the main thread will not be modified:
int main()
{
thread_local int n {42};
std::thread t([&] { My::f(&n); });
t.join();
std::cout << n << "\n"; // prints 42 (not 43)
return 0;
}
Here is a more complicated example. It calls the function two times to show that the state is preserved between the calls. Also its output shows that the threads operate on their own state:
#include <iostream>
#include <thread>
void foo() {
thread_local int n = 1;
std::cout << "n=" << n << " (main)" << std::endl;
n = 100;
std::cout << "n=" << n << " (main)" << std::endl;
int& n_ = n;
std::thread t([&] {
std::cout << "t executing...\n";
std::cout << "n=" << n << " (thread 1)\n";
std::cout << "n_=" << n_ << " (thread 1)\n";
n += 1;
std::cout << "n=" << n << " (thread 1)\n";
std::cout << "n_=" << n_ << " (thread 1)\n";
std::cout << "t executing...DONE" << std::endl;
});
t.join();
std::cout << "n=" << n << " (main, after t.join())\n";
n = 200;
std::cout << "n=" << n << " (main)" << std::endl;
std::thread t2([&] {
std::cout << "t2 executing...\n";
std::cout << "n=" << n << " (thread 2)\n";
std::cout << "n_=" << n_ << " (thread 2)\n";
n += 1;
std::cout << "n=" << n << " (thread 2)\n";
std::cout << "n_=" << n_ << " (thread 2)\n";
std::cout << "t2 executing...DONE" << std::endl;
});
t2.join();
std::cout << "n=" << n << " (main, after t2.join())" << std::endl;
}
int main() {
foo();
std::cout << "---\n";
foo();
return 0;
}
Output:
n=1 (main)
n=100 (main)
t executing...
n=1 (thread 1) # the thread used the "n = 1" init code
n_=100 (thread 1) # the passed reference, not the thread_local
n=2 (thread 1) # write to the thread_local
n_=100 (thread 1) # did not change the passed reference
t executing...DONE
n=100 (main, after t.join())
n=200 (main)
t2 executing...
n=1 (thread 2)
n_=200 (thread 2)
n=2 (thread 2)
n_=200 (thread 2)
t2 executing...DONE
n=200 (main, after t2.join())
---
n=200 (main) # second execution: old state is reused
n=100 (main)
t executing...
n=1 (thread 1)
n_=100 (thread 1)
n=2 (thread 1)
n_=100 (thread 1)
t executing...DONE
n=100 (main, after t.join())
n=200 (main)
t2 executing...
n=1 (thread 2)
n_=200 (thread 2)
n=2 (thread 2)
n_=200 (thread 2)
t2 executing...DONE
n=200 (main, after t2.join())
Upvotes: 2
Reputation: 2819
I find thread_local
is only useful in three cases:
If you need each thread to have a unique resource so that they don't have to share, mutex, etc. for using said resource. And even so, this is only useful if the resource is large and/or expensive to create or needs to persist across function invocations (i.e. a local variable inside the function will not suffice).
An offshoot of (1) - you may need special logic to run when a calling thread eventually terminates. For this, you can use the destructor of the thread_local
object created in the function. The destructor of such a thread_local
object is called once for each thread that entered the code block with the thread_local
declaration (at the end of the thread's lifetime).
You may need some other logic to be performed for each unique thread that calls it, but only once. For instance, you could write a function that registers each unique thread that called a function. This may sound bizarre, but I've found uses for this in managing garbage-collected resources in a library I'm developing. This usage is closely-related to (1) but doesn't get used after its construction. Effectively a sentry object for a thread's entire lifetime.
Upvotes: 5
Reputation: 15933
Putting aside the great examples already given by Cruz Jean (I don't think I could add to those), consider also the following: there's no reason to forbid it. I don't think you doubt the usefulness of thread_local
or question why it should be in the language in general. There is a well-defined meaning for a thread_local
block-scope variable simply as a result of how storage classes and scopes work in C++. Just because one cannot think of something "interesting" to do with every possible combination of language features doesn't mean that all combinations of language features that don't have at least one known "interesting" application must explicitly be disallowed. By that logic, we'd also have to go ahead and disallow classes with no private members from having friends and whatnot. At least to me, C++ in particular seems to follow a philosophy of "if there's no specific technical reason why feature X cannot work in situation Y, then there's no reason to forbid it", which I would consider a quite healthy approach. Forbidding things for no good reason means adding complexity for no good reason. And I believe everyone would agree that there's already enough complexity in C++. It also prevents happy accidents like when, only after many years, a certain language feature is suddenly discovered to have previously unthought-of applications. The most prominent example of such a case would probably be templates which (at least as far as I'm aware) were not originally conceived with the purpose of metaprogramming in mind; it just turned out later that they could also be used for that…
Upvotes: 1