Reputation: 737
I tried to compare 2 executions of the same Dask code on my 2 computers.
First computer is equiped with Intel I5 6500 4 threads / 16GB Ram, and the other one is a MBP with Intel I7 8750H 12 threads and / 16GB Ram.
Here is the function :
%%timeit
my_df.map_partitions(
lambda df: df.apply(
lambda x: detect(str(x['Description'])), axis=1)).\
compute(scheduler='processes')
I was expecting the execution to be approx. 3 times faster but I got those results for 10 000 rows :
22.9 seconds for the I5 6500
21.6 seconds for the I7 8750H
Same partition size and same parameters.
How is it possible ?
Upvotes: 2
Views: 72
Reputation: 28683
A few notes to help you understand the situation. I note that you do not specify how the data is loaded or what detect
is.
compute
the whole result, which means copying the partitions back to the main client thread and concatenating them. There is no parallelism in this.With the distributed scheduler, you can time just the processing part
client = distributed.Client(...)
p_df = my_df.persist()
distributed.wait(p_df)
# time this bit
out = p_df.map_partitions(
lambda df: df.apply(
lambda x: detect(str(x['Description'])), axis=1)).persist()
distributed.wait(out)
Upvotes: 3