NoSenseEtAl
NoSenseEtAl

Reputation: 30138

How does memory usage of thread_local scale with number of threads?

I presume C/C++ standards do not say anything about complexity so I am curious about specific implementations (I presume they all have same behavior).

Assume I have the following C++ function.

void fn() {
    thread_local char arr[1024*1024]{};
    // do something with arr
}

And my program has 80 threads, 47 of them at least once run fn().

Does the memory usage of my program grows around 47 times some constant, 80 times some constant, or is there some other formula for this?

note: there is this Java question that got closed for some reason, but IDK if Java uses same primitives as C/C++.

Upvotes: 2

Views: 1201

Answers (2)

Alan Birtles
Alan Birtles

Reputation: 36488

This is likely largely implementation dependant though you can verify the behaviour of your implementation fairly easily. For example running the following program on windows (using a debug visual studio build to avoid optimisations removing the unused code):

#include <iostream>
#include <array>
#include <thread>

struct Foo
{
    std::array<char, 1'000'000'000> data;
};

void bar()
{
    thread_local Foo foo;
    for (int i = 0; i < foo.data.size(); i++)
    {
        foo.data[i] = i;
    }
    std::this_thread::sleep_for(std::chrono::seconds(1000));
}

int main()
{
    std::thread thread1([]
    {
        bar();
    });

    std::thread thread2([]
    {
        std::this_thread::sleep_for(std::chrono::seconds(1000));
    });

    thread1.join();
    thread2.join();
}

Uses 3GB of memory (1GB for the two threads and 1GB for the main thread). Removing thread2 drops the memory usage to 2GB. On Linux this behaviour is likely to be different as it has over allocation and unused memory pages are not allocated until they're used.

You can avoid this by using smart pointers to only allocate the memory when its actually used, for example changing bar to:

void bar()
{
    thread_local std::unique_ptr<Foo> foo = std::make_unique<Foo>();
    for (int i = 0; i < foo->data.size(); i++)
    {
        foo->data[i] = i;
    }
    std::this_thread::sleep_for(std::chrono::seconds(1000));
}

Reduces the memory usage to 1GB as only thread1 actually allocates the large array, thread2 and the main thread only have to store the unique_ptr.

Upvotes: 3

Galik
Galik

Reputation: 48645

According to the C++11 Standard:

3.7.2 Thread storage duration [ basic.stc.thread ]

1 All variables declared with the thread_local keyword have thread storage duration. The storage for these entities shall last for the duration of the thread in which they are created. There is a distinct object or reference per thread, and use of the declared name refers to the entity associated with the current thread.

2 A variable with thread storage duration shall be initialized before its first odr-use (3.2) and, if constructed, shall be destroyed on thread exit.

It says, "The storage for these entities shall last for the duration of the thread in which they are created.". So, to my reading, the memory must be allocated for all of the threads.

However, they are only initialized and destructed if they are used: "A variable with thread storage duration shall be initialized before its first odr-use (3.2) and, if constructed, shall be destroyed on thread exit".

Upvotes: 5

Related Questions