Reputation: 763
How to find the optimal number of workers for parfor
on Amazon's virtual machine?
For which cases I should use the number of the physical and for which the number of the logical cores?
Is there any "Rule of Thumb" for this?
I run a compiled code ( an executable code ).
Upvotes: 1
Views: 248
Reputation: 1
How to find the optimal number of workers for
parfor
on Amazon's virtual machine?
In similarly, just partially, defined situations, I start with similar set of principal objections like those above "Optimal according to what criterial function ( utility fun / penalty fun ) of what parameters and costs - [TIME]
( first, partial result's latency, E2E completion ), [SPACE]
( yes: cache & RAM footprints ), scaling, externalities ( energy-costs, { owned | rented }
-infrastructure-costs, R&D-costs, design/engineering-costs, QA-costs, validation/certification-costs, other related labour costs, costs of risk-mitigation policies - to name just a few ) ?".
This tells us about what happens if not having any & all criteria pre-defined: "( having no target ), any road will get you there..."
This question was solved by Dr.Gene Amdahl (1967), based on work of Prof. Kenneth E. Knight (1966) and the solution is in applying the "Law of diminishing returns", named after the former, Amdahl Law. For both details and for contemporary criticism of naively applied original ( overhead-naive and atomic-amounts of work ) read this.
Step 0 :
test/record the net-time of all pure-[SERIAL]
-ly executed sections of the Code-under-Review ( CuR ), that are "before" and "after" the wished-to-have parfor
-syntax-constructor.
Step 1 :
test/record the net-time of all parfor
-{ instantiation + termination }-overhead costs. Here take all properly-scaled parameters for call-signature and returned-values types & sizings ( The CuR has to spend some time to serialise/deserialise all of them on a per-call basis and has also to spend additional overhead time to SER/DES on preparing/transporting/collecting each of the "remotely"-parfor
-ed results ), scale of MEM-allocs - that also takes remarkable amounts of time in comparison to CuR if just some "shallow"-computing-density or memory-areas' re-use inefficient parallel-CuR happens to get into consideration ). Some of these add-on overhead costs ( recorded in [TIME]-domain ) get accrued "inside" the parfor
-decorated parts of the code ( and are invisible or do not happen during a pure-[SERIAL]
code-execution, so test/benchmark may require some work to isolate these add-on costs "inside" the parfor
-ed section's suspect-ops, that allocate and never re-use larger memory-areas etc, if the costs modeling strives to count down to the cents of externally paid infrastructure expenses ).
Step 2 :
test/record the atomicity-of-workunit introduced "orphaned"-time of the last-batch of workunit(s) that will never get faster (due to atomicity-of-workunit - indivisible duration of work, that no other free processor-core will help )
...should use the number of the physical and for which the number of the logical cores?
Step 3 :
test/record actual work-stealing / net-own-work ratio for either of provided ( whatever these are "marketing"-labeled ) types of code-execution units ( virtual-devices bearing the most of inefficiencies due to higher latencies, higher work-stealing ratios - i.e. expect way poorer efficiencies for any form of a primarily computing-intensive parallel-workers )
Step 4 :
test/record actual frequency / cache-sizing / RAM-caused CPU-starvation conditions, that will cause the "units-of-work" actually "there"-execute poorer, than one have expected on { local-| private-grid-}-computing infrastructure.
Step 5 :
compare the all-accrued-costs for any number of cores/types under consideration, using due (and properly relaxed by all in-efficiencies recorded above ) parameters from Steps 0:4 and using price-plan(s) from any of your infrastructure providers you will get approximate costs of using more/less-resources for any given time/financial-budget constraint.
All credits go to Dr.Gene Amdahl, whose work is hated by all marketeers selling as many as possible "cheap"-labeled yet-underperforming-and-shared-only toys (yes, virtualised means also another layer of add-on overheads and results in SHARED co-execution on silicon ( 0.5 [ns]
cache depletions at a cost of many repeated ~350+[ns]
re-fetches of the previously already cached data again "across" the NUMA locality-boundary resulting in repetitive RAM-I/O-channels' bottlenecks are the least loss to mention here ) - the sharing/virtualisation-introduced work-stealing of CPU-ticks will show you the amounts of stolen CPU-ticks your computing-payloads did not get to compute in favour of others, "sharing" this very same "cloudy-heaven-dream", paid by you on a "per-hour-of-use" basis, outside of your domain of control ).
Clouds may be fine for seldom running some ad-hoc, low computing-intensity ( which DSP, obviously, is not, is it? ) with minimum cross-domain communication/coordination, cohort of reasonably-sized tasks that may benefit from distributed, latency-masking, (naive)-brute-force execution of many such "shallow" & "not-much-demanding" work-units ( many times inefficient per-se from the point of view of HPC-grade computing ), yet the very "cheap"-(just)-sounding price-plans make one realise, that one's own, private infrastructure would cost ( if decided in due time, before already spending these expenses ) about the same as using the "cheaply" rented one for the second, third, ... n-th time.
Upvotes: 4