chasep255
chasep255

Reputation: 12185

Threads seem to be slowing down image processing C++11

I am writing a function to change the values of the pixels in an image. The way it works is by splitting up the task of shading each pixel into multiple threads. For example if there are 4 threads then each one will shade every 4 pixels. What I find strange is that the threaded approach is about a 1/10 of a second slower than doing it in a single loop. I can't figure out why this is since I have a quad core CPU and there is no real synchronization involved between the threads. I would expect it to be about 4x faster minus a bit of overhead. Am I doing something wrong here?

Note that I set nthreads=1 to measure the single loop approach.

FYI raster is a pointer in the class which points to a dynamic array of pixels.

void RGBImage::shade(Shader sh, size_t sx, size_t sy, size_t ex, size_t ey)
{
    validate();
    if(ex == 0)
        ex = width;
    if(ey == 0)
        ey = height;

    if(sx < 0 || sx >= width || sx >= ex || ex > width || sy < 0 || sy >= height || sy >= ey
            || ey > height)
        throw std::invalid_argument("Bounds Invalid");

    size_t w = ex - sx;
    size_t h = ey - sy;
    size_t nthreads = std::thread::hardware_concurrency();
    if(nthreads > MAX_THREADS)
        nthreads = MAX_THREADS;
    else if(nthreads < 1)
        nthreads = 1;

    size_t load_per_thread = w * h / nthreads;
    if(load_per_thread < MIN_THREAD_LOAD)
        nthreads = (w * h) / MIN_THREAD_LOAD;

    clock_t start = clock();
    if(nthreads > 1)
    {
        std::unique_ptr<std::thread[]> threads(new std::thread[nthreads]);
        for(size_t i = 0; i < nthreads; i++)
            threads[i] = std::thread([=]()
            {   
                for(size_t p = i; p < (w * h); p += nthreads)
                {   
                    size_t x = sx + p % w;
                    size_t y = sy + p / w;
                    sh(raster[y * width + x], x, y);
                }
            });
        for(size_t i = 0; i < nthreads; i++)
            threads[i].join();
    }
    else
    {
        for(size_t p = 0; p < (w * h); ++p)
        {
            size_t x = sx + p % w;
            size_t y = sy + p / w;
            sh(raster[y * width + x], x, y);
        }
    }
    std::cout << ((float)(clock() - start) / CLOCKS_PER_SEC) << std::endl;
}

I took some of the advice ans changed up my function.

void RGBImage::shade(Shader sh, bool threads)
{
    validate();
    clock_t c = clock();
    if(threads)
    {
        int nthreads = std::thread::hardware_concurrency();
        size_t pix = width * height;
        if(nthreads < 1)
            nthreads = 1;
        else if(nthreads > MAX_THREADS)
            nthreads = MAX_THREADS;
        if(pix / nthreads < MIN_THREAD_LOAD)
            nthreads = pix / MIN_THREAD_LOAD;

        size_t pix_per_threads = pix / nthreads;

        std::unique_ptr<std::thread[]> t(new std::thread[nthreads]);
        for(int i = 0; i < nthreads; i++)
        {
            t[i] = std::thread([=]()
            {
                size_t offset = i * pix_per_threads;
                size_t x = offset % width;
                size_t y = offset / width;
                sh(raster + offset, *this, x, y, 
                        i == nthreads - 1 ? pix_per_threads + (width * height) % nthreads : pix_per_threads);
            });
        }
        for(int i = 0; i < nthreads; i++)
            t[i].join();
    }
    else
    {
        sh(raster, *this, 0, 0, width * height);
    }
    std::cout << ((float)(clock() - c) / CLOCKS_PER_SEC) << std::endl;
}

Now it runs about 10x faster but the threaded version is still slower.

Upvotes: 0

Views: 139

Answers (3)

chasep255
chasep255

Reputation: 12185

Well that answer is pretty simple. The threaded solution was infact faster. It was just consuming more clock() time and the clock() function is not good for timing threads.

Upvotes: 0

Yakk - Adam Nevraumont
Yakk - Adam Nevraumont

Reputation: 275878

What you have done is maximized contention between threads.

You want to minimize it.

Threads should work on a scanline at a time (or more). Divide your image into n blocks of roughly equal number of scanlines (left of image to right), and tell each thread to work on the nth block of scanlines.

std::vector<std::thread> threads;
threads.reserve(nthreads);
for(size_t i = 0; i < nthreads; i++) {
  size_t v_start = (h*i)/nthreads;
  size_t v_end = (h*(i+1))/nthreads;
  threads.push_back(std::thread([=]()
  {   
    for(size_t y = v_start; y < v_end; ++y)
    {   
      for (size_t x = 0; x < w; ++x) {
        sh(raster[y * width + x], x, y);
      }
    }
  }));
}
for(auto&& thread:threads)
  thread.join();

another approach is to grab the ppl (parallel patterns library) and use it. It will dynamically balance the number of threads based on the current load and hardware specs, and might use thread pooling to reduce thread startup costs.

A serious concern is your Shader sh. You do not want to be calling anything as expensive as a function pointer (or even more expensive, a std::function) on a per-pixel basis.

My general rule is that I write a "for each pixel" function that takes the pixel operation as a F&&, and pass it to a "for each scanline" function after wrapping the pixel-shader in an (in-header-file) scanline based operation. The cost of indirection is then reduced to once per scanline. In addition, the compiler may be able to optimize between pixel operations (say, doing SIMD), while a per-pixel call cannot be optimized this way.

A final problem with your "interleave" solution is it makes it impossible for the compiler to vectorize your code. Vectorization can easily give a 3-4x speedup.

Upvotes: 6

mg3
mg3

Reputation: 131

in C++ you can take advantage of those cores using parallelism or use amp (accelerated massive parallelism). My vote will be to do the latter.

sample amp project: http://austin.codeplex.com/

https://msdn.microsoft.com/en-us/library/hh265137.aspx http://blogs.msdn.com/b/nativeconcurrency/archive/2012/08/30/learn-c-amp.aspx

Upvotes: -2

Related Questions