newbie
newbie

Reputation: 21

why does having more than one thread(parallel processing) in some specific cases degrade performance?

i noticed that having more than a thread running for some code is much much slower than having one thread, and i have been really pulling my hair to know why,can anyone help?

code explanation : i have ,sometimes, a very large array that i need to process parts of in a parallel way for optimization,each "part" of a row gets looped on and processed on in a specific thread, now i've noticed that if i only have one "part",i.e the whole array and a single worker thread that runs through it is noticeably faster than if i divide the array and process it as separate sub arrays with different threads.

    bool m_generate_row_worker(ull t_row_start,ull t_row_end)
    {
        for(;t_row_start<t_row_end;t_row_start++)
            {
                m_current_row[t_row_start]=m_singularity_checker(m_previous_row[t_row_start],m_shared_random_row[t_row_start]);

            }
        return true;
    }

    ...
    //code
    ...
    for(unsigned short thread_indx=0;thread_indx<noThreads-1;thread_indx++)
    {
        m_threads_array[thread_indx]=std::thread(
                m_generate_row_worker,this,
                thread_indx*(m_parts_per_thread),(thread_indx+1)*(m_parts_per_thread));
    }
    m_threads_array[noThreads-1]=std::thread(m_generate_row_worker,this,
            (noThreads-1)*(m_parts_per_thread),std::max((noThreads)*(m_parts_per_thread),m_blocks_per_row));
    //join
    for(unsigned short thread_indx=0;thread_indx<noThreads;thread_indx++)
    {
        m_threads_array[thread_indx].join();
    }
//EDIT 
    inline ull m_singularity_checker(ull t_to_be_ckecked_with,ull 
    t_to_be_ckecked)
    {
            return (t_to_be_ckecked & (t_to_be_ckecked_with<<1)
             & (t_to_be_ckecked_with>>1) ) | (t_to_be_ckecked_with & 
    t_to_be_ckecked);
    }

Upvotes: 1

Views: 843

Answers (2)

eerorika
eerorika

Reputation: 238351

why does having more than one thread(parallel processing) in some specific cases degrade performance?

  • Because thread creation has overhead. If the task to be performed has only small computational cost, then the cost of creating multiple threads is more than the time saved by parallelism. This is especially the case when creating significantly more threads than there are CPU cores.
  • Because many algorithms do not easily divide into independent sub-tasks. Dependencies on other threads requires synchronisation, which has overhead that can in some cases be more than the time saved by parallelism.
  • Because in poorly designed programs, synchronization can cause all tasks to be processed sequentially even if they are in separate threads.
  • Because (depending on CPU architecture) sometimes otherwise correctly implemented, and seemingly independent tasks have effectual dependency because they operate on the same area of memory. More specifically, when a threads writes into a piece of memory, all threads operating on the same cache line must synchronise (the CPU does this for you automatically) to remain consistent. The cost of cache misses is often much higher than the time saved by parallelism. This problem is called "false sharing".
  • Because sometimes introduction of multi threading makes the program more complex, which makes it more difficult for the compiler / optimiser to make use of instruction level parallelism.
  • ...

In conclusion: Threads are not a silver bullet that automatically multiplies the performance of your program.


Regarding your program, we cannot count out any of the above potential issues given the excerpt that you have shown.

Some tips on avoiding or finding above issues:

  • Don't create more threads than you have cores, discounting the number of threads that are expected to be blocking (waiting for input, disk, etc).
  • Only use multi-threading with problems that are computationally expensive, (or to do work while a thread is blocking, but this may be more efficiently solved using asynchronous I/O and coroutines).
  • Don't do (or do as little as possible) I/O from more than one thread into a single device (disk, NIC, virtual terminal, ...) unless it is specially designed to handle it.
  • Minimise the number of dependencies between threads. Consider all access to global things that may cause synchronisation, and avoid them. For example, avoid memory allocation. Keep in mind that things like operations on standard containers do memory allocation.
  • Keep the memory touched by distinct threads far from each other (not adjacent small elements of array). If processing an array, divide it in consecutive blocks, rather than striping one element every (number of threads)th element. In some extreme cases, extra copying into thread specific data structures, and then joining in the end may be efficient.
  • If you've done all you can, and multi threading measures slower, consider whether perhaps it is not a good solution for your problem.

Upvotes: 3

code_fodder
code_fodder

Reputation: 16341

Using threads do not always mean that you will get more work done. For example using 2 threads does not mean you will get a task done in half the time. There is an overhead to setting up the threads and depending on how many cores and OS etc... how much context switching is occurring between threads (saving the thread stack/regs and loading the next one - it all adds up). At some point adding more threads will start to slow your program down since there will be more time spent switching between threads/setting threads up/down then there is work being done. So you may be a victim of this.

If you have 100 very small items (like 1 instruction) of work to do, then 100 threads will be guaranteed to be slower since you now have ("many instructions" + 1) x 100 of work to do. Where the "many instructions" are the work of setting up the threads and clearing them up at the end - and switching between them.

So, you may want to start to profile this for yourself.. How much work is done processing each row and how many threads in total are you setting up?

One very crude, but quick/simple way to start to measure is to just take the time elapsed to processes one row in isolation (e.g. use std::chrono functions to measure the time at the start of processing one row and then take the time at the end to see total time spent. Then maybe do the same test over the entire table to get an idea how total time.

If you find that a individual row is taking very little time then you may not be getting so much benefit from the threads... You may be better of splitting the table into chunks of work that are equal to the number of cores your CPU has, then start changing the number of threads (+/-) to find the sweet spot. Just making threads based on number of rows is a poor choice - you really want to design it to max out each core (for example).

So if you had 4 cores, maybe start by splitting the work into 4 threads to start with. Then test it with 8 if its better try 16, if its worse try 12....etc...

Also you might get different results on different PCs...

Upvotes: 1

Related Questions