Reputation: 41601
Since Julia is really fast, I was wondering what the fastest way is to join data frames. For example, in R we may use the data.table
package which is pretty fast. Sometimes when working with big datasets the computation time becomes high. Here I created a benchmark using innerjoin
and leftjoin
:
julia> using StatsBase, DataFrames, BenchmarkTools
julia> n = 1000000
julia> df1 = DataFrame(x = 1:n,
y1 = rand(n))
julia> df2 = DataFrame(x = 1:n,
y2 = rand(n))
julia> @benchmark innerjoin(df1, df2, on = :x)
BenchmarkTools.Trial: 102 samples with 1 evaluation.
Range (min … max): 41.437 ms … 73.495 ms ┊ GC (min … max): 0.00% … 29.87%
Time (median): 45.926 ms ┊ GC (median): 0.00%
Time (mean ± σ): 49.160 ms ± 8.227 ms ┊ GC (mean ± σ): 7.26% ± 11.50%
▄▅ █ ▂ ▂
██▆██▅████▅▃▃▅█▆▁▅▁▁▁▁▃▃▁▁▁▁▆▁▃▅▁▁▁▃▁▅▅▅▅▃▁▁▃▁▃▁▃▁▁▁▃▃▃▁▁▁▃ ▃
41.4 ms Histogram: frequency by time 71.7 ms <
Memory estimate: 38.16 MiB, allocs estimate: 174.
julia> @btime innerjoin(df1, df2, on = :x)
41.592 ms (174 allocations: 38.16 MiB)
julia> @benchmark leftjoin(df1, df2, on = :x)
BenchmarkTools.Trial: 96 samples with 1 evaluation.
Range (min … max): 43.823 ms … 79.582 ms ┊ GC (min … max): 0.00% … 34.30%
Time (median): 48.566 ms ┊ GC (median): 0.00%
Time (mean ± σ): 52.387 ms ± 9.026 ms ┊ GC (mean ± σ): 6.74% ± 10.90%
█▂▁▅▄▅
██████▅▃▆▆▃▃▅▅▆▆▃▁█▃▃▁▁▃▁▃▃▁▁▁▁▃▃▁▅▃▁▁█▃▃▃▃▅▃▁▁▅▁▁▁▁▁▁▃▁▁▁▃ ▁
43.8 ms Histogram: frequency by time 76.9 ms <
Memory estimate: 39.23 MiB, allocs estimate: 230.
julia> @btime leftjoin(df1, df2, on = :x)
44.198 ms (230 allocations: 39.23 MiB)
Here we can see that innerjoin is in this case slightly faster. So, I was wondering if there are faster ways of joining data frames in Julia?
Upvotes: 2
Views: 350
Reputation: 6395
If you know that the values in :x
column in both data frames have the same sequence of values (which is happening in your case), then you can use hcat
. I get a better result on hcat
(~24,500x faster 👀 on average without copying the data and ~7x faster with copying) (note that you should prepend $
to df1
and df2
):
julia> @benchmark hcat($df1, $df2[!, 2:end], copycols=false)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.550 μs … 1.705 ms ┊ GC (min … max): 0.00% … 99.45%
Time (median): 2.440 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.670 μs ± 17.047 μs ┊ GC (mean ± σ): 6.35% ± 0.99%
▂█▇▆▁ ▁▂▂▃▄▁
█████▅▄▃▄▆████████▇▆▅▅▄▃▃▃▃▂▂▂▂▂▂▂▁▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
1.55 μs Histogram: frequency by time 5.72 μs <
Memory estimate: 2.66 KiB, allocs estimate: 34.
# with copying
julia> @benchmark hcat($df1, $df2[!, 2:end])
BenchmarkTools.Trial: 633 samples with 1 evaluation.
Range (min … max): 4.387 ms … 50.850 ms ┊ GC (min … max): 0.00% … 86.24%
Time (median): 6.380 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.876 ms ± 6.799 ms ┊ GC (mean ± σ): 17.48% ± 16.80%
▄▅██▅▃▂
███████▆▄▅▅▄▄▆▅▁▁▁▁▅▄▁▁▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▄▄▄▄▅▆▄▅▅▄▅ ▇
4.39 ms Histogram: log(frequency) by time 42.1 ms <
Memory estimate: 22.89 MiB, allocs estimate: 64.
julia> hcat(df1, df2[!, 2:end]) == hcat(df1, df2[!, 2:end], copycols=false) == innerjoin(df1, df2, on = :x) == leftjoin(df1, df2, on = :x)
true
# Element-wise comparison (However, the above expression is enough)
julia> all(Matrix(hcat(df1, df2[!, 2:end]) .== hcat(df1, df2[!, 2:end], copycols=false) .== innerjoin(df1, df2, on = :x) .== leftjoin(df1, df2, on = :x)))
true
If I want to make a summary:
Time | Memory | |
---|---|---|
hcat | ~24,500x faster | ~19,000x fewer |
hcat(with copy) | ~7x faster | ~1.7x fewer |
*Note that the comparisons are against your best result, which is related to innerjoin(df1, df2, on = :x)
Also, note that @benchmark
contains comprehensive results, and it contains @btime
as well:
So you don't need to necessarily run @btime
to get it!
Upvotes: 2