user85877
user85877

Reputation:

Multiprocessor and Performance

I'm facing a really strange problem with a .Net service.

I developed a multithreaded x64 windows service.

I tested this service in a x64 server with 8 cores. The performance was great!

Now I moved the service to a production server (x64 - 32 cores). During the tests I found out the performance is, at least, 10 times worst than in the test server.

I've checked loads of performance counters trying to find some reason for this poor performance, but I couldn't find a point.

Could be a GC problem? Have you ever faced a problem like this?

Thank you in advance! Alexandre

Upvotes: 2

Views: 1151

Answers (7)

George V. Reilly
George V. Reilly

Reputation: 16313

I agree with Blank, it's likely to be some form of contention. It's likely to be very hard to track down, unfortunately. It could be in your application code, the framework, the OS, or some combination thereof. Your application code is the most likely culprit, since Microsoft has expended significant effort on making the CLR and the OS scale on 32P boxes.

The contention could be in some hot locks, but it could be that some processor cache lines are sloshing back and forth between CPUs.

What's your metric for 10x worse? Throughput?

Have you tried booting the 32-proc box with fewer CPUs? Use the /NUMPROC option in boot.ini or BCDedit.

Do you achieve 100% CPU utilization? What's your context switch rate like? And how does this compare to the 8P box?

Upvotes: 0

jgauffin
jgauffin

Reputation: 101130

How many threads are you using? Using to many thread pool threads could cause thread starvation which would make your program slower.

Some articles: http://www2.sys-con.com/ITSG/virtualcd/Dotnet/archives/0112/gomez/index.html http://codesith.blogspot.com/2007/03/thread-starvation-in-shared-thread-pool.html

(search for thread starvation in them)

You could use a .net profiler to find your bottle necks, here are a good free one: http://www.eqatec.com/tools/profiler

Upvotes: 0

Adam Jaskiewicz
Adam Jaskiewicz

Reputation: 10996

With that many threads running concurrently, you're going to have to be really careful to get around issues of threads fighting with each other to access your data. Read up on Non-blocking synchronization.

Upvotes: 0

user82238
user82238

Reputation:

This is a common problem which people are generally unaware of, because very few people have experience on many-CPU machines.

The basic problem is contention.

As the CPU count increases, contention increases in all shared data structures. For low CPU counts, contention is low and the fact you have multiple CPUs improves performance. As the CPU count becomes significantly larger, contention begins to drown out your performance improvements; as the CPU count becomes large, contention actually starts reducing performance below that of a lower number of CPUs.

You are basically facing one of the aspects of the scalability problem.

I'm not sure however where this problem lies; in your data structures, or in the operating systems data structures. The former you can address - lock-free data structures are an excellent, highly scalable approach. The latter is difficult, since it essentially requires avoiding certain OS functionality.

Upvotes: 9

Marc Gravell
Marc Gravell

Reputation: 1062540

There are so many factors here:

  • are you actually using the cores?
  • are your extra threads causing locking issues to be more obvious?
  • do you not have enough memory to support all the extra stacks / data you can process?
  • can your IO (disk/network/database) stack keep up with the throughput?

etc

Upvotes: 1

Alan Jackson
Alan Jackson

Reputation: 6511

There are way too many variables to know why one machine is slower than the other. 32 core machines are usually more specialized where an eight core could just be a dual proc quad core machine. Are there vm's or other things running at the same time? Usually with that many cores, IO bandwidth becomes the limiting factor (even if the cpu's still have plenty of bandwidth).

To start off, you should probably add lots of timers in your code (or profiling or whatever) to figure out what part of your code is taking up the most time.

Performance troublshooting 101: what is the bottleneck ( where in the code and what subsystem (memory, disk, cpu) )

Upvotes: 2

dommer
dommer

Reputation: 19810

Could it be down to differences in memory or the disk? If there were the bottleneck, you'd not get the value for the additional processing power. Can't really tell without more details of your application/configuration.

Upvotes: 0

Related Questions