Lockyli
Lockyli

Reputation: 19

Why is there such large variation in computation time for Fibonacci generation between adjacent values of n?

Each method is getting large variations in computation time, even with lots of repeats (to obtain a mean). The effect appears to get worse with larger n.

Results graph (25 repeats)

This is my timing code.

def timeBlock(func, n, rep):
    start = time.perf_counter()
    for i in range(rep):
        func(n)
    end = time.perf_counter()
    return (end - start)/rep

def worker(args):
    func, n, rep = args
    try:
        elapsed = timeBlock(func, n, rep)
        return n, elapsed, None
    except Exception as e:
        return n, float("inf"), traceback.format_exc()

def trackFuncs(funcs):
    results = {name: [] for name in funcs.keys()}
    numProcesses = max(1, cpu_count() - 1)
    pool = Pool(processes=numProcesses)    
    try:
        for name, func in funcs.items():
            n = 0
            progressBar = tqdm(desc=f"Tracking {name}", unit="", dynamic_ncols=True)
            while True:
                tasks = [(func, n + i, REPEATS) for i in range(numProcesses)]
                try:
                    asyncResults = pool.imap_unordered(worker, tasks)
                    for result in asyncResults:
                        n, elapsed, error = result
                        if error:
                            print(f"Error for {name} at n={n}:\n{error}")
                            continue
                        results[name].append((n, elapsed))
                        if elapsed > MAXT:
                            progressBar.close()
                            print(f"Reached {MAXT}s for {name} at n={n}")
                            raise StopIteration
                except StopIteration:
                    break
                n += numProcesses
                progressBar.update(numProcesses)
            progressBar.close()
    except KeyboardInterrupt:
        print("\nProcess interrupted by user")
    finally:
        pool.close()
        pool.terminate()
        pool.join()
    return results

funcs within trackFuncs is a dictionary of Fibonacci functions, with the key as the function's name.

The functions in the graph's legend should be ordered approximately by speed, but the results don't always show that due to large spikes exceeding the computation time threshold.

I've tried repeating and averaging the time for each n, but the effect also appears to worsen with more repeats.

Any ideas why this happens and how, if possible, it can be fixed?

Upvotes: 1

Views: 78

Answers (2)

hobbs
hobbs

Reputation: 240512

Judging from the timing harness code I see, you're running all of the different algorithms in parallel and timing them independently? Then you'll get a huge amount of variation on any modern system because cores share resources.

For starters, if you have SMT (hyperthreading), then you have twice as many logical cores (cpu_count()) as actual CPU cores, with the two threads on each core sharing time. Ideally, this produces a nice boost in efficiency and overall system throughput, because a single thread never uses 100% of a core's resources. By letting one thread use some execution units while another thread uses the rest, or letting one thread have free reign while the other is waiting for a memory request to be filled, the CPU can get more work done than if there was only one thread per core. But it makes timings much less predictable, because the CPU resources available to one of your algorithms depends on which other algorithm is running on its SMT sibling (which could even be nothing at all, depending on scheduling).

Cores are also sharing cache and memory bandwidth in a complex way that depends on the topology of the individual chip ("core clusters" and such), so performance depends on what's running on the whole system at that moment, but especially on what's running on "nearby cores", whatever "nearby" means for your CPU. And who gets placed near to what is up to the OS, unless you go out of your way to set up "core affinity".

Another shared resource is power, and its cousin, the ability to remove heat. Modern CPUs are constantly adjusting the speed of their cores depending on the amount of work that each one is doing, to keep the total power usage under some limit. One or two cores may be able to run very fast while the others are idle, while if all cores are active the "speed limit" is generally quite a bit lower. Different code uses different amounts of power, and again there are limits both locally and globally, so the speed of the CPU you're running on depends on what else is running at the moment, both overall and "nearby". And not only what's running right now, but what's been running recently, because heat takes time to dissipate through silicon, so chips will allow using a lot of power for a brief moment, and then force everything to slow down for a while after to prevent burning up. All of this is almost entirely out of your control, so the speed of the core you're running on is effectively a random variable.

And, depending on the CPU you're running on, not all of the cores are necessarily even the same. A CPU die might contain completely different designs of cores with vastly different performance profiles, under names like "performance cores" and "efficiency cores", with the OS choosing among them based on some guesswork as to the amount of horsepower a given process really needs. But if you have that, and your benchmarking workload is trying to keep all cores active at once, then it's effectively random which kind of core one of your subtests will run on.

All in all, benchmarking on modern systems is a very complex thing, and mostly requires that you do the opposite of what you've done here: run a single thread, pinned to a single core, on a system that's otherwise as quiescent as possible. Then do repeated runs starting from idle, and bin the data based on how long it's been since you were idle, to get an idea of how much impact "turbo" has on you (or, simply run for long enough that the turbo gets amortized out and you can see the steady state).

Or, if you're interested in all-core performance rather than single-core, then load up all cores with the same exact code, collect your data from that and average it, then move on to the next algorithm.

Upvotes: 0

Pete Kirkham
Pete Kirkham

Reputation: 49331

I would hazard a guess that you have two main sources:

  • the process start will add noise for small N
  • for larger N, once the integer can't be represented as a primitive, it will be represented as an object and every now and then you will hit a garbage collection spike.

What you can do about it rather depends what the purpose of the exercise is - you are after all running the algorithms in a garbage collected environment, so gc will affect the time if you use them for real. But if you want to test 'just the algorithm' and hope you don't run out memory, then you can turn GC off (gc.disable() ) for the test then request a collection after recording the time.

Upvotes: 0

Related Questions