Alex Popov
Alex Popov

Reputation: 2519

Multi-threaded C program much slower in OS X than Linux

I wrote this for an OS class assignment that I've already completed and handed in. I posted this question yesterday, but due to "Academic Honesty" regulations I took it off until after the submission deadline.

The object was to learn how to use critical sections. There is a data array with 100 monotonously increasing numbers, 0...99, and 40 threads that randomly swap two elements 2,000,000 times each. Once a second a Checkergoes through and makes sure that there is only one of each number (which means that no parallel access happened).

Here were the Linux times:

real    0m5.102s
user    0m5.087s
sys     0m0.000s

and the OS X times

real    6m54.139s
user    0m41.873s
sys     6m43.792s

I run a vagrant box with ubuntu/trusty64 on the same machine that is running OS X. It is a quad-core i7 2.3Ghz (up to 3.2Ghz) 2012 rMBP.

If I understand correctly, sys is system overhead, which I have no control over, and even then, 41s of user time suggests that perhaps the threads are running serially.

I can post all the code if needed, but I will post the bits I think are relevant. I am using pthreads since that's what Linux provides, but I assumed they work on OS X.

Creating swapper threads to run swapManyTimes routine:

for (int i = 0; i < NUM_THREADS; i++) {
    int err = pthread_create(&(threads[i]), NULL, swapManyTimes, NULL);
}

Swapper thread critical section, run in a for loop 2 million times:

pthread_mutex_lock(&mutex);    // begin critical section
int tmpFirst = data[first];
data[first] = data[second];
data[second] = tmpFirst;
pthread_mutex_unlock(&mutex);  // end critical section

Only one Checker thread is created, same way as Swapper. It operates by going over the data array and marking the index corresponding to each value with true. Afterwards, it checks how many indices are empty. as such:

pthread_mutex_lock(&mutex);
for (int i = 0; i < DATA_SIZE; i++) {
    int value = data[i];
    consistency[value] = 1;
}
pthread_mutex_unlock(&mutex); 

It runs once a second by calling sleep(1) after it runs through its while(1) loop. After all swapper threads are joined this thread is cancelled and joined as well.

I would be happy to provide any more information that can help figure out why this sucks so much on Mac. I'm not really looking for help with code optimization, unless that's what's tripping up OS X. I've tried building it using both clang and gcc-4.9 on OS X.

Upvotes: 14

Views: 3323

Answers (4)

Buzz Moschetti
Buzz Moschetti

Reputation: 7568

Just came across this question. It's been 8 years and OS X and MacBooks now use Apple Silicon M2 chips. I compiled the program posted above and compiled with cc -O2 thread1.cc -o thread1.

The machine is a 2023 MBP w/Apple M2 Max chip. 12 cores, 3.49GHz, no hyperthreading. 64GB RAM. OS X Ventura 13.2.

The compiler is:

$ cc --version
Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: arm64-apple-darwin22.3.0
Thread model: posix

Here are the results:

$ time  ./thread1 10
Thread exiting with avg streak length 43.901835
Thread exiting with avg streak length 46.534646
Thread exiting with avg streak length 42.759162
Thread exiting with avg streak length 49.811208
Thread exiting with avg streak length 45.170582
Thread exiting with avg streak length 40.739703
Thread exiting with avg streak length 43.436409
Thread exiting with avg streak length 41.932822
Thread exiting with avg streak length 46.498628
Thread exiting with avg streak length 49.501871

real    0m0.301s
user    0m0.244s
sys     0m2.476s

Old OSX was

real    0m34.163s
user    0m5.902s
sys     0m30.329s

so basically 100X faster. Here's the 40 thread run:

$ time  ./thread1 40
Thread exiting with avg streak length 40.080895
Thread exiting with avg streak length 40.563484
Thread exiting with avg streak length 41.033526
Thread exiting with avg streak length 40.101704
Thread exiting with avg streak length 40.923921
Thread exiting with avg streak length 39.206587
Thread exiting with avg streak length 41.269448
Thread exiting with avg streak length 40.074901
Thread exiting with avg streak length 39.983886
Thread exiting with avg streak length 39.298998
Thread exiting with avg streak length 41.183559
Thread exiting with avg streak length 40.596663
Thread exiting with avg streak length 40.342142
Thread exiting with avg streak length 40.017688
Thread exiting with avg streak length 40.077315
Thread exiting with avg streak length 42.075402
Thread exiting with avg streak length 40.006241
Thread exiting with avg streak length 39.011589
Thread exiting with avg streak length 39.169611
Thread exiting with avg streak length 40.141985
Thread exiting with avg streak length 38.546334
Thread exiting with avg streak length 40.282509
Thread exiting with avg streak length 39.643092
Thread exiting with avg streak length 39.955526
Thread exiting with avg streak length 40.576165
Thread exiting with avg streak length 40.821767
Thread exiting with avg streak length 40.840747
Thread exiting with avg streak length 40.670436
Thread exiting with avg streak length 40.204157
Thread exiting with avg streak length 38.819978
Thread exiting with avg streak length 40.324757
Thread exiting with avg streak length 39.573690
Thread exiting with avg streak length 40.386526
Thread exiting with avg streak length 39.140447
Thread exiting with avg streak length 38.662529
Thread exiting with avg streak length 40.751783
Thread exiting with avg streak length 37.928655
Thread exiting with avg streak length 38.996100
Thread exiting with avg streak length 41.535742
Thread exiting with avg streak length 39.928289

real    0m1.222s
user    0m1.019s
sys     0m12.889s

Very fast! I then tried it on a new AWS Linux c5.4xlarge instance:

16 procs like this:
model name      : Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
stepping        : 7
microcode       : 0x5003501
cpu MHz         : 3621.221

32GB RAM

Compiled with:

$ cc --version
cc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-15)

$ cc -O2 thread1.cc -lpthread -o thread1

yields this result:

$ time ./thread1 10
Thread exiting with avg streak length 2.319996
Thread exiting with avg streak length 2.286782
Thread exiting with avg streak length 2.281775
Thread exiting with avg streak length 2.316133
Thread exiting with avg streak length 2.266562
Thread exiting with avg streak length 2.268253
Thread exiting with avg streak length 2.270898
Thread exiting with avg streak length 2.243067
Thread exiting with avg streak length 2.249166
Thread exiting with avg streak length 2.221897

real    0m4.604s
user    0m11.333s
sys     0m31.416s

So the new Linux/16 proc machine is slower than the "old" Linux. Interesting. Thanks to @jschultz410 for posting such a compact and highly portable piece of C code.

Upvotes: 0

jschultz410
jschultz410

Reputation: 2899

I've duplicated your result to a goodly extent (without the sweeper):

#include <stdlib.h>
#include <stdio.h>

#include <pthread.h>

pthread_mutex_t Lock;
pthread_t       LastThread;
int             Array[100];

void *foo(void *arg)
{
  pthread_t self  = pthread_self();
  int num_in_row  = 1;
  int num_streaks = 0;
  double avg_strk = 0.0;
  int i;

  for (i = 0; i < 1000000; ++i)
  {
    int p1 = (int) (100.0 * rand() / (RAND_MAX - 1));
    int p2 = (int) (100.0 * rand() / (RAND_MAX - 1));

    pthread_mutex_lock(&Lock);
    {
      int tmp   = Array[p1];
      Array[p1] = Array[p2];
      Array[p2] = tmp;

      if (pthread_equal(LastThread, self))
        ++num_in_row;

      else
      {
        ++num_streaks;
        avg_strk += (num_in_row - avg_strk) / num_streaks;
        num_in_row = 1;
        LastThread = self;
      }
    }
    pthread_mutex_unlock(&Lock);
  }

  fprintf(stdout, "Thread exiting with avg streak length %lf\n", avg_strk);

  return NULL;
}

int main(int argc, char **argv)
{
  int       num_threads = (argc > 1 ? atoi(argv[1]) : 40);
  pthread_t thrs[num_threads];
  void     *ret;
  int       i;

  if (pthread_mutex_init(&Lock, NULL))
  {
    perror("pthread_mutex_init failed!");
    return 1;
  }

  for (i = 0; i < 100; ++i)
    Array[i] = i;

  for (i = 0; i < num_threads; ++i)
    if (pthread_create(&thrs[i], NULL, foo, NULL))
    {
      perror("pthread create failed!");
      return 1;
    }

  for (i = 0; i < num_threads; ++i)
    if (pthread_join(thrs[i], &ret))
    {
      perror("pthread join failed!");
      return 1;
    }

  /*
  for (i = 0; i < 100; ++i)
    printf("%d\n", Array[i]);

  printf("Goodbye!\n");
  */

  return 0;
}

On a Linux (2.6.18-308.24.1.el5) server Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz

[ltn@svg-dc60-t1 ~]$ time ./a.out 1

real    0m0.068s
user    0m0.068s
sys 0m0.001s
[ltn@svg-dc60-t1 ~]$ time ./a.out 2

real    0m0.378s
user    0m0.443s
sys 0m0.135s
[ltn@svg-dc60-t1 ~]$ time ./a.out 3

real    0m0.899s
user    0m0.956s
sys 0m0.941s
[ltn@svg-dc60-t1 ~]$ time ./a.out 4

real    0m1.472s
user    0m1.472s
sys 0m2.686s
[ltn@svg-dc60-t1 ~]$ time ./a.out 5

real    0m1.720s
user    0m1.660s
sys 0m4.591s

[ltn@svg-dc60-t1 ~]$ time ./a.out 40

real    0m11.245s
user    0m13.716s
sys 1m14.896s

On my MacBook Pro (Yosemite 10.10.2) 2.6 GHz i7, 16 GB memory

john-schultzs-macbook-pro:~ jschultz$ time ./a.out 1

real    0m0.057s
user    0m0.054s
sys 0m0.002s
john-schultzs-macbook-pro:~ jschultz$ time ./a.out 2

real    0m5.684s
user    0m1.148s
sys 0m5.353s
john-schultzs-macbook-pro:~ jschultz$ time ./a.out 3

real    0m8.946s
user    0m1.967s
sys 0m8.034s
john-schultzs-macbook-pro:~ jschultz$ time ./a.out 4

real    0m11.980s
user    0m2.274s
sys 0m10.801s
john-schultzs-macbook-pro:~ jschultz$ time ./a.out 5

real    0m15.680s
user    0m3.307s
sys 0m14.158s
john-schultzs-macbook-pro:~ jschultz$ time ./a.out 40

real    2m7.377s
user    0m23.926s
sys 2m2.434s

It took my Mac ~12x times as much wall clock time to complete with 40 threads and that's versus a very old version of Linux + gcc.

NOTE: I changed my code to do 1M swaps per thread.

It looks like under contention OSX is doing a lot more work than Linux. Maybe it is interleaving them much more finely than Linux does?

EDIT Updated code to record avg number of times a thread re-captures the lock immediately:

Linux

[ltn@svg-dc60-t1 ~]$ time ./a.out 10
Thread exiting with avg streak length 2.103567
Thread exiting with avg streak length 2.156641
Thread exiting with avg streak length 2.101194
Thread exiting with avg streak length 2.068383
Thread exiting with avg streak length 2.110132
Thread exiting with avg streak length 2.046878
Thread exiting with avg streak length 2.087338
Thread exiting with avg streak length 2.049701
Thread exiting with avg streak length 2.041052
Thread exiting with avg streak length 2.048456

real    0m2.837s
user    0m3.012s
sys 0m16.040s

Mac OSX

john-schultzs-macbook-pro:~ jschultz$ time ./a.out 10
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000
Thread exiting with avg streak length 1.000000

real    0m34.163s
user    0m5.902s
sys 0m30.329s

So, OSX is sharing its locks much more evenly and therefore has many more thread suspensions and resumptions.

Upvotes: 6

user3629249
user3629249

Reputation: 16540

The OP does not mention/show any code that indicates the thread(s) sleep, wait, give up execution, etc and all the threads are at the same 'nice' level.  

so an individual thread may well get the CPU and not release it until it has completed all 2mil executions.

This would result in a minimal amount of time performing context switches, on linux.

However, on the MAC OS, a execution is only given a 'time slice' to execute, before another 'ready to execute' thread/process is allowed to execute.

This means many many more context switches.

Context switches are performed in 'sys' time.

The result is the MAC OS will take much longer to execute.

To even the playing field, you could force context switches, by inserting a nanosleep() or a call to release execution via

#include <sched.h>

then calling

int sched_yield(void);

Upvotes: 1

Log_n
Log_n

Reputation: 404

MacOSX and Linux implement pthread differently, causing this slow behavior. Specifically MacOSX does not use spinlocks (they are optional according to ISO C standard). This can lead to very, very slow code performance with examples like this one.

Upvotes: 7

Related Questions