Reputation: 30755
I have an application with 4 threads, which I'm running on an 8-core Intel Process with CentOS. When I run two instances of it, the two instances combined run faster than 1 instance. Moreover, for two instances, they run faster when I do not bind each thread to a different core and left scheduling to the scheduler.
Going with that logic, perhaps, one instance is also running slow because scheduler had pinned all the 4 threads to 4 separate cores. When I pin each thread of 2 instances, that is, 8 threads pinned to 8 cores, the two instances combined also take approximately the same time as the one instance.
The application is lock intensive, as there are frequent shared memory acesses. Also note that the CPU overhead remains 0% otherwise, so there are almost no other processes consuming cycles.
Now my question is what is going on here? One explanation that comes to my mind is that when threads are pinned to different cores, they try to acquire the mutexes almost at the same time, and as a result, high contention occurs, but when they are scheduled by the scheduler, they might try to acquire mutexes at slightly different times, thus decreasing contention.
Any opinions! This is really weird, because left to the scheduler, one instance runs slower than two instances combined. How do I compile my benchmarking results. No on would believe me!
Processor Info as given by /proc/cpuinfo is the following.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
cpu MHz : 2660.076
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 0
cpu cores : 4
apicid : 16
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.15
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
cpu MHz : 2660.076
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5319.96
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
cpu MHz : 2660.076
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 1
cpu cores : 4
apicid : 18
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.04
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
cpu MHz : 2660.076
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 1
cpu cores : 4
apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.05
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
processor : 4
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
cpu MHz : 2660.076
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 2
cpu cores : 4
apicid : 20
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.04
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
processor : 5
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
cpu MHz : 2660.076
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 2
cpu cores : 4
apicid : 4
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.05
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
processor : 6
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
cpu MHz : 2660.076
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 3
cpu cores : 4
apicid : 22
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.03
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
cpu MHz : 2660.076
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 6
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.07
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
processor : 8
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
cpu MHz : 2660.076
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 0
cpu cores : 4
apicid : 17
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.01
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
processor : 9
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
cpu MHz : 2660.076
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.07
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
processor : 10
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
cpu MHz : 2660.076
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 1
cpu cores : 4
apicid : 19
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.04
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
processor : 11
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
cpu MHz : 2660.076
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 1
cpu cores : 4
apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.00
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
processor : 12
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
cpu MHz : 2660.076
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 2
cpu cores : 4
apicid : 21
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.03
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
processor : 13
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
cpu MHz : 2660.076
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 2
cpu cores : 4
apicid : 5
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.05
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
processor : 14
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
cpu MHz : 2660.076
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 3
cpu cores : 4
apicid : 23
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.02
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
processor : 15
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
cpu MHz : 2660.076
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 5320.04
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]
Upvotes: 4
Views: 310
Reputation: 21926
You're likely suffering from a combination of three effects:
Your pinning is actually tying the scheduler's hands. You said your program is lock heavy. When a thread fails to acquire a lock, it goes to sleep, freeing up the CPU to handle a different thread. But if you've pinned your threads, then the scheduler can't move any of the threads that have work to do to the CPU that's free! So your thread pinning can actually decrease performance.
False sharing. When you run multiple instance so that each is their own process, the OS is going to give each process different physical memory pages (unlike two threads in the same process). This means that the processes can't accidentally share cache lines. Make sure any global data being accessed from multiple threads is either read only or on its own cache line (See posix_memalign).
If you have a lot of locks, your program may just be too single threaded. Anytime you use locks to prevent two threads from acting at the same time you're essentially saying, "make this part of my program single threaded." Too many instances of that and you end up with a single threaded program with the added overhead of locks. Multiple instances mean no lock contention, so you get closer to your original single threaded performance.
Upvotes: 0
Reputation: 37214
One explanation is cache and memory architecture; and data locality.
According to Intel's spec finder, for your X5550 CPU there's 4 cores (and 8 logical CPUs) per physical chip. If you've got 16 logical CPUs then you've got 2 physical chips (which is entirely possible for this Xeon). More research reveals that each core has it's own L1 and L2 caches that are shared by both logical CPUs in that core; and each physical chip has 8 MiB of L3 cache that is shared by all 4 cores (8 logical CPUs) on that chip.
If logical CPUs 0 and 1 share an L2 cache and logical CPUs 2 and 3 share a separate L2 cache, then 2 threads modifying the same memory may run better on logical CPUs 0 and 1 (rather than logical CPUs 0 and 2) because the data remains in the "exclusive" state on one cache (and isn't bouncing between different L2 caches).
In the same way, if the first eight logical CPUs share an L3 cache and the other group of eight logical CPUs share an L3 cache, then 8 threads modifying the same memory may run better when restricted to one group of logical CPUs (and not spread across separate physical chips with separate L3 caches).
The other thing is that with a pair of "Xeon X550" CPUs it's extremely likely that your computer is ccNUMA. Two separate memory banks, where one memory bank is directly connected to the first physical CPU and the other memory bank directly connected to the second physical CPU. In this case, when a CPU tries to access RAM that isn't directly connected to that CPU, it needs to ask the other CPU (where the RAM is connected) to fetch it (rather than being able to fetch it itself), and this has a performance penalty (roughly "15% slower").
Linux does have special support ccNUMA systems. It will try to limit each process (and all its threads) to a specific group of CPUs and allocate memory that is directly connected to those CPUs; in an attempt to avoid that "roughly 15% slower" performance penalty.
If you combine all of this, a process with 8 threads that is CPU bound and is continually fighting for the same locks (modifying cache lines where those locks are) will probably run better on the first 8 logical CPUs or the last 8 logical CPUs; and there will be a relatively high performance problem (maybe as bad as "30% slower" once you combined the NUMA penalty and the cache bounce problem) if you tried to spread the process' threads evenly across logical CPUs (e.g. CPUs 0, 2, 4, 6, 8, 10, etc).
Upvotes: 3
Reputation: 94175
Is your CPU Hyper-Threading enabled?
Try to bind processes to even-numbered cores: CPU0, CPU2, CPU6, CPU8 as it was suggested by Agner fox (there is a citation in Linux find out Hyper-threaded core id )
If you binded 4 threads to CPU0- CPU3, you will bind them to 2 physical cores and there will be contention for resources (HT-thread is not as fast physical thread IF there is another HT-thread on the same CPU).
In linux, HT cores are numbered in pairs: so CPU0-CPU1 are two HT-halves of first physical core, CPU2-CPU3 - of second and so on.
If you get a slowdown from using only even-numbered CPUs, I conclude you have a lot of syncronization between threads. HT-threads can be synchronized fasted than threads of different physical cores.
Upvotes: 3
Reputation: 3638
One possible explanation is that they can use the same core, (one goes to sleep, the other wakes up), so can share the same cache, which reduces cache misses, and therefore improves performance rather significantly on that alone. A CPU spending 100% of its quantum will not hog the CPU completely. Bringing another thread in will make the other thread run while the first is sleeping on whatever action.
Upvotes: 0