How is processor speed distributed across threads?

Question

Objective: I am trying to estimate how fast my code will execute when run concurrently in multiple threads.

Question 1)

If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?

Question 2)

What impact, if any, does the presence of other threads effect the execution speed of each other thread?

My Situation:

I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time. Takes Half a second on 1 thread and I was worried how this will scale with multiple users performing the same query. Every user requests is handled by a separate thread so 100 simultaneous users will require 100 simultaneous threads. Each thread is sharing the same resource but read only. No writing. Is there any chance I could get each user to see roughly the same execution time?

Note: I know it will depend upon a number of factors but surely there must be some way of identifying whether or not your code will scale if you find it takes x amount of time for a single thread given x hardware. As final note I'd like to add I have limited experience with computer hardware architecture and how multi-threading works under the hood.

Brian Cain · Accepted Answer

If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?

No, you should determine it empirically.

What impact, if any, does the presence of other threads effect the execution speed of each other thread?

Computation-bound tasks will likely scale very well and be mostly independent of other threads. Interestingly enough, some CPU manufacturers implement features which can increase the clock of a lone-busy CPU core to compensate for the all the idle cores. This sort of feature might confound your measurements and expectations about scaling.

Cache/Memory/disk-bound tasks will start to contend with each other except for where resource partitions exist.

I know it will depend upon a number of factors

Absolutely! So I recommend that you prototype it and measure it. And then find out why it didn't scale as well as you'd hoped and try a different algorithm. Iterate.

but surely there must be some way of identifying whether or not your code will scale

Yes, but unfortunately it requires a detailed description of the algorithm implemented by the code. Your results will be heavily dependent on the ratio of your code's activity among these general regions, and your target's capability for these:

disk I/O
network I/O
memory I/O
computation

My Situation: My application runs in an app server that assigns one thread for every user request. If my application executes in 2 seconds for 1 user I can't assume it will be always take 2 seconds if say 100 users are simultaneously running the same operation correct?

If your app server computes pi to 100 digits for each user request, it will likely scale reasonably well until you encounter the core limit of your target.

If your app server does database queries for each user request, it will likely scale only as well as the target hardware can sustain the necessary load.

EDIT given specifics:

I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time.

Your problem sounds memory+cache-bound. You should study the details of your target CPU/mem deployment or if you are designing it, opt for high memory throughput.

A NUMA system ("resource partitioning" for memory) can likely maximize your overall concurrent memory throughput. Note that since your problem seems to dictate concurrent access to the same memory pages, a NUMA system would penalize the process doing remote memory accesses. In this case, consider creating multiple copies of the data at initialization time.
Depending on the pattern of traversal, TLB pressure might be a factor. Consider experimenting with huge (aka "large") pages.
Cache contention may be a factor in scaling as well.
Your specific algorithm could easily end up dominating over any of the specific system effects, depending on how far apart the best and worst cases are.

limited experience with computer hardware architecture and how multi-threading works under the hood.

Profile the query using CPU performance counters with a tool like Intel's VTune, perf, or oprofile. It can tell you where expensive operations are executing in your code. With this information you can optimize your query to perform well (individually and in aggregate).

How is processor speed distributed across threads?

Answers (2)

Related Questions