Tikoloche
Tikoloche

Reputation: 371

Understanding Thrust (CUDA) memory usage

I 'm using the cuda/thrust library to do some Monte Carlo simulations. This works very well up to a certain number of simulations where I get a bad_alloc exception. This seems alright because an increasing number of simulations in my code means handling increasingly large device_vectors. So I expect this kind of exception to show up at some point.

What I'd like to do now is to set an upper limit on this number of simulations based on the available memory on my GPU. Then, I could split the workload in bunches of simulations.

So I've been trying to size my problem before launching my set of simulations. Unfortunately, when I'm trying to understand the way the memory is managed with simple examples I get surprising results.

Here is an example of code I have been testing:

#include <cuda.h>
#include <thrust/system_error.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <cuda_profiler_api.h>

int main() 
{
    size_t freeMem, totalMem;

    cudaDeviceReset();
    cudaSetDevice(0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << "Total Memory | Free Memory "<< std::endl;
    std::cout << totalMem << ", " << freeMem << std::endl;

    thrust::device_vector<float> vec1k(1000, 0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << totalMem << ", " << freeMem << std::endl;

    thrust::device_vector<float> vec100k(100000, 0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << totalMem << ", " << freeMem << std::endl;

    thrust::device_vector<float> vec1M(1000000, 0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << totalMem << ", " << freeMem << std::endl;

    return 0;
}

And here are the results I get:

Total Memory | Free Memory
2147483648, 2080542720
2147483648, 2079494144
2147483648, 2078445568
2147483648, 2074382336

So, basically,

I would have expected the memory usage to scale roughly with the number of elements but I get a "4x" when I expected a "10x", and this relationship does not hold between 1,000 and 100,000 elements.

So, my 2 questions are:

Edit

Following Mai Longdong comment, I tried with two vectors, one of 262144 floats (4 bytes) and the other of 262145. Unfortunately, things don't look like a straight "per 1MB page allocation" :

Delta between the two is 131072 bytes (or 128 KB). The page size would be variable? Does this make sense?

Upvotes: 3

Views: 2355

Answers (1)

talonmies
talonmies

Reputation: 72349

Thrust doesn't do anything magic with memory management, the default allocator is just cudaMalloc, and what you are seeing is the driver memory manager page size selection algorithm at work. This isn't documented, and there is no indication that behaviour is consistent between platform and hardware versions.

That said, if I expand your code into something a bit more useful:

#include <iostream>
#include <vector>
#include <thrust/system_error.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>

void report_mem(size_t allocd, bool first=false)
{
    size_t freeMem, totalMem;
    cudaMemGetInfo(&freeMem, &totalMem);
    if (first) 
        std::cout << "Allocated | Total Memory | Free Memory "<< std::endl;
    std::cout << allocd << ", " << totalMem << ", " << freeMem << std::endl;
}

int main() 
{
    cudaSetDevice(0);

    report_mem(0, true);
    std::vector<size_t> asizes;
    const int nallocs = 10;
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<14);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<16);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<18);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<20);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<22);

    typedef thrust::device_vector<float> dvecf_t;
    std::vector<dvecf_t*> allocs;
    auto it = asizes.begin();
    for(; it != asizes.end(); ++it) {
        dvecf_t* v = new dvecf_t(*it);
        allocs.push_back(v);
    report_mem(v->capacity() * sizeof(float));
    }
    return 0;
}

and run it on a compute 2.1 device on Windows 64 bit, I get this:

Allocated | Total Memory | Free Memory 
0, 1073741824, 1007849472
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
262144, 1073741824, 1005752320
262144, 1073741824, 1005752320
262144, 1073741824, 1005752320
262144, 1073741824, 1005752320
262144, 1073741824, 1004703744
262144, 1073741824, 1004703744
262144, 1073741824, 1004703744
262144, 1073741824, 1004703744
262144, 1073741824, 1003655168
262144, 1073741824, 1003655168
1048576, 1073741824, 1002606592
1048576, 1073741824, 1001558016
1048576, 1073741824, 1000509440
1048576, 1073741824, 999460864
1048576, 1073741824, 998412288
1048576, 1073741824, 997363712
1048576, 1073741824, 996315136
1048576, 1073741824, 995266560
1048576, 1073741824, 994217984
1048576, 1073741824, 993169408
4194304, 1073741824, 988975104
4194304, 1073741824, 984780800
4194304, 1073741824, 980586496
4194304, 1073741824, 976392192
4194304, 1073741824, 972197888
4194304, 1073741824, 968003584
4194304, 1073741824, 963809280
4194304, 1073741824, 959614976
4194304, 1073741824, 955420672
4194304, 1073741824, 951226368
16777216, 1073741824, 934449152
16777216, 1073741824, 917671936
16777216, 1073741824, 900894720
16777216, 1073741824, 884117504
16777216, 1073741824, 867340288
16777216, 1073741824, 850563072
16777216, 1073741824, 833785856
16777216, 1073741824, 817008640
16777216, 1073741824, 800231424

which I interpret as indicating that the allocation granularity is 1MiB (1048576 or 2^20 bytes) on the platform I tested this on. Your platform might be different.

Upvotes: 2

Related Questions