Getting higher execution times with increasing number of OMP threads

Question

I'm parallelizing the following block in order to compute the number of lethal accidents per week over a variable size (0.5M, 1M, 2M rows, etc.) dataset stored inside localRows which is a vector. The shared variables local_lethAccPerWeek, local_accAndPerc, local_boroughWeekAcc are arrays stored contiguously (e.g. int local_lethAccPerWeek[NUM_YEARS][NUM_WEEKS_PER_YEAR] = {};).

    // [2] Data processing
    procBegin = MPI_Wtime();

    cout << "[Proc. " + to_string(myrank) + "] Started processing dataset..." << endl;
    omp_set_num_threads(num_omp_threads);
    int cfIndex, brghIndex;

// Every worker will compute in the final datastructure the num of lethal accidents for its sub-dataset and then Reduce it to allow the master to collect final results
#pragma omp parallel for default(shared) schedule(dynamic) private(cfIndex, brghIndex)
    for (int i = 0; i < my_num_rows; i++)
    {
        int lethal = (localRows[i].num_pers_killed > 0) ? 1 : 0;
        string borough = string(localRows[i].borough);
        int week, year, month = 0;

        if (lethal || !borough.empty())
        {
            week = getWeek(localRows[i].date);
            year = getYear(localRows[i].date);
            month = getMonth(localRows[i].date);

            // If I'm week = 1 and month = 12, this means I belong to the first week of the next year.
            // If I'm week = (52 or 53) and month = 01, this means I belong to the last week of the previous year.
            if (week == 1 && month == 12)
                year++;
            else if ((week == 52 || week == 53) && month == 1)
                year--;

            year = year - 2012;
            week = week - 1;
        }

        /* Query1 */
        if (lethal)
#pragma omp atomic
            local_lethAccPerWeek[year][week]++;

        /* Query2 */
        for (int k = 0; k < localRows[i].num_contributing_factors; k++)
        {
            cfIndex = cfDictionary.at(string(localRows[i].contributing_factors[k]));
#pragma omp critical(query2)
            {
                (local_accAndPerc[cfIndex].numAccidents)++;
                (local_accAndPerc[cfIndex].numLethalAccidents) += lethal;
            }
        }

        /* Query3 */
        if (!borough.empty()) // if borough is not specified we're not interested
        {
            brghIndex = brghDictionary.at(borough);
#pragma omp critical(query3)
            {
                local_boroughWeekAcc[brghIndex][year][week].numAccidents++;
                local_boroughWeekAcc[brghIndex][year][week].numLethalAccidents += lethal;
            }
        }
    }

    procDuration = MPI_Wtime() - procBegin;

I'm experiencing a strange behavior since increasing the omp threads gets me higher execution times. I'm aware that spawning threads increase the overhead due to context switches etc. and in some cases it can be way smoother to just let one thread do the job, but I don't see how parallelizing an operation of this kind (that's just an increment in the atomic section) could be worse. I've also tried changing out of curiosity the scheduling but of course doesn't help.

I'm asking you since you may see something that I'm missing. Thanks in advance and please comment if you need further infos.

Hristo Iliev · Accepted Answer

Several notes here:

You are using schedule(dynamic). This means that every single iteration of the loop gets dispatched to a different thread on a first-come first-served basis. This adds a lot of overhead, especially if my_num_rows is large. It is better to work with chunks of iterations, say N iterations each, so try changing the schedule clause to schedule(dynamic,N).

You have a lot of instances of true and false sharing where you nullify the benefits of having CPU caches due to the following two.

Atomic updates of shared variables are way slower in parallel than when done by a single thread, because the L1/L2 cache line holding the value gets constantly invalidated and reloaded from down the cache hierarchy. In a sequential program, the cache lines stay hot and the compiler can even apply register optimisations if it is a single value (last one not applicable to your case since you are incrementing array elements).
Similar to the previous one, false sharing happens when you update two distinct array elements that happen to be in the same cache line. For example, that seems likely in Q2, especially if the number of contributing factors is low.

What you can do is to sort and group your localRows by borough and date, then spread the computation over the groups. That will prevent true and false sharing issues while updating the aggregates in Q1 and Q3. As for the contributing factors in Q2, if they are not that many, use OpenMP reduction.

Getting higher execution times with increasing number of OMP threads

Answers (1)

Related Questions