Usman Mutawakil
Usman Mutawakil

Reputation: 5259

How is processor speed distributed across threads?

Objective: I am trying to estimate how fast my code will execute when run concurrently in multiple threads.

Question 1)

If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?

Question 2)

What impact, if any, does the presence of other threads effect the execution speed of each other thread?

My Situation:

I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time. Takes Half a second on 1 thread and I was worried how this will scale with multiple users performing the same query. Every user requests is handled by a separate thread so 100 simultaneous users will require 100 simultaneous threads. Each thread is sharing the same resource but read only. No writing. Is there any chance I could get each user to see roughly the same execution time?

Note: I know it will depend upon a number of factors but surely there must be some way of identifying whether or not your code will scale if you find it takes x amount of time for a single thread given x hardware. As final note I'd like to add I have limited experience with computer hardware architecture and how multi-threading works under the hood.

Upvotes: 1

Views: 324

Answers (2)

Brian Cain
Brian Cain

Reputation: 14619

If I know exactly how fast my code runs for a single request in one thread is their any way of estimating how fast it will run amongst multiple threads?

No, you should determine it empirically.

What impact, if any, does the presence of other threads effect the execution speed of each other thread?

Computation-bound tasks will likely scale very well and be mostly independent of other threads. Interestingly enough, some CPU manufacturers implement features which can increase the clock of a lone-busy CPU core to compensate for the all the idle cores. This sort of feature might confound your measurements and expectations about scaling.

Cache/Memory/disk-bound tasks will start to contend with each other except for where resource partitions exist.

I know it will depend upon a number of factors

Absolutely! So I recommend that you prototype it and measure it. And then find out why it didn't scale as well as you'd hoped and try a different algorithm. Iterate.

but surely there must be some way of identifying whether or not your code will scale

Yes, but unfortunately it requires a detailed description of the algorithm implemented by the code. Your results will be heavily dependent on the ratio of your code's activity among these general regions, and your target's capability for these:

  • disk I/O
  • network I/O
  • memory I/O
  • computation

My Situation: My application runs in an app server that assigns one thread for every user request. If my application executes in 2 seconds for 1 user I can't assume it will be always take 2 seconds if say 100 users are simultaneously running the same operation correct?

If your app server computes pi to 100 digits for each user request, it will likely scale reasonably well until you encounter the core limit of your target.

If your app server does database queries for each user request, it will likely scale only as well as the target hardware can sustain the necessary load.

EDIT given specifics:

I traverse a graph in memory of worst case size 1 million nodes. It's simply accessing 1 million memory addresses 1 at a time.

Your problem sounds memory+cache-bound. You should study the details of your target CPU/mem deployment or if you are designing it, opt for high memory throughput.

  • A NUMA system ("resource partitioning" for memory) can likely maximize your overall concurrent memory throughput. Note that since your problem seems to dictate concurrent access to the same memory pages, a NUMA system would penalize the process doing remote memory accesses. In this case, consider creating multiple copies of the data at initialization time.
  • Depending on the pattern of traversal, TLB pressure might be a factor. Consider experimenting with huge (aka "large") pages.
  • Cache contention may be a factor in scaling as well.
  • Your specific algorithm could easily end up dominating over any of the specific system effects, depending on how far apart the best and worst cases are.

limited experience with computer hardware architecture and how multi-threading works under the hood.

Profile the query using CPU performance counters with a tool like Intel's VTune, perf, or oprofile. It can tell you where expensive operations are executing in your code. With this information you can optimize your query to perform well (individually and in aggregate).

Upvotes: 2

Curt
Curt

Reputation: 5722

These are all interesting questions, but there is, unfortunately, no straightforward answer, because the answer will depend on a lot of different factors.

Most modern machines are multi-core: in an ideal situation, a four-thread process has the ability to scale up almost linearly in a four-core machine (i.e. run four times as fast).

Most programs, though, spend most of their time waiting for things: disk or database access, the memory bus, network I/O, user input, and other resources. Faster machines don't generally make these things appreciably faster.

The way that most modern operating systems, including Windows, Unix/Linux, and MacOS, use the processor is by scheduling processor time to processes and threads in a more-or-less round-robin manner: at any given time there may be threads that are waiting for processor time (this is a bit simplistic, as they all have some notions of process prioritization, so that high-criticality processes get pushed up the queue earlier than less important ones).

When a thread is using a processor core, it gets it all for as long as its time slice lasts: indeed, only one thing at a time is actually running on a single core. When the process uses up its time slice, or requests some resource that isn't immediately available, it its turn at the processor core is ended, and the next scheduled task will begin. This tends to make pretty optimal use of the processor resources.

So what are the factors that determine how well a process will scale up?

  • What portion of its run time does a single process spend waiting for I/O and user input?

  • Do multiple threads hit the same resources, or different ones?

  • How much communication has to happen between threads? Between individual threads and your processes main thread? This takes synchronization, and introduces waiting.

  • How "tight" are the hotspots of the active thread? Can the body of it fit into the processor's memory, or does the (much slower) bus memory have to be accessed?

As a general rule, the more independent individual threads are of one another, the more linearly your application will scale. In real-world business applications, though, that is far from the case. The best way to increase the scaling ability of your process is to understand it--and its dependencies--well, and then use a profiler to find out where the most waiting occurs, and see if you can devise technical strategies to obviate them.

Upvotes: 4

Related Questions