Quinten
Quinten

Reputation: 41601

What is the fastest way to join DataFrames in Julia?

Since Julia is really fast, I was wondering what the fastest way is to join data frames. For example, in R we may use the data.table package which is pretty fast. Sometimes when working with big datasets the computation time becomes high. Here I created a benchmark using innerjoin and leftjoin:

julia> using StatsBase, DataFrames, BenchmarkTools
julia> n = 1000000
julia> df1 = DataFrame(x = 1:n,
                       y1 = rand(n))
julia> df2 = DataFrame(x = 1:n,
                       y2 = rand(n))

julia> @benchmark innerjoin(df1, df2, on = :x)
BenchmarkTools.Trial: 102 samples with 1 evaluation.
 Range (min … max):  41.437 ms … 73.495 ms  ┊ GC (min … max): 0.00% … 29.87%
 Time  (median):     45.926 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.160 ms ±  8.227 ms  ┊ GC (mean ± σ):  7.26% ± 11.50%

  ▄▅ █   ▂      ▂                                              
  ██▆██▅████▅▃▃▅█▆▁▅▁▁▁▁▃▃▁▁▁▁▆▁▃▅▁▁▁▃▁▅▅▅▅▃▁▁▃▁▃▁▃▁▁▁▃▃▃▁▁▁▃ ▃
  41.4 ms         Histogram: frequency by time        71.7 ms <

 Memory estimate: 38.16 MiB, allocs estimate: 174.

julia> @btime innerjoin(df1, df2, on = :x)
  41.592 ms (174 allocations: 38.16 MiB)

julia> @benchmark leftjoin(df1, df2, on = :x)
BenchmarkTools.Trial: 96 samples with 1 evaluation.
 Range (min … max):  43.823 ms … 79.582 ms  ┊ GC (min … max): 0.00% … 34.30%
 Time  (median):     48.566 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   52.387 ms ±  9.026 ms  ┊ GC (mean ± σ):  6.74% ± 10.90%

  █▂▁▅▄▅                                                       
  ██████▅▃▆▆▃▃▅▅▆▆▃▁█▃▃▁▁▃▁▃▃▁▁▁▁▃▃▁▅▃▁▁█▃▃▃▃▅▃▁▁▅▁▁▁▁▁▁▃▁▁▁▃ ▁
  43.8 ms         Histogram: frequency by time        76.9 ms <

 Memory estimate: 39.23 MiB, allocs estimate: 230.

julia> @btime leftjoin(df1, df2, on = :x)
  44.198 ms (230 allocations: 39.23 MiB)

Here we can see that innerjoin is in this case slightly faster. So, I was wondering if there are faster ways of joining data frames in Julia?

Upvotes: 2

Views: 350

Answers (1)

Shayan
Shayan

Reputation: 6395

If you know that the values in :x column in both data frames have the same sequence of values (which is happening in your case), then you can use hcat. I get a better result on hcat (~24,500x faster 👀 on average without copying the data and ~7x faster with copying) (note that you should prepend $ to df1 and df2):

julia> @benchmark hcat($df1, $df2[!, 2:end], copycols=false)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.550 μs …  1.705 ms  ┊ GC (min … max): 0.00% … 99.45%
 Time  (median):     2.440 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.670 μs ± 17.047 μs  ┊ GC (mean ± σ):  6.35% ±  0.99%

  ▂█▇▆▁      ▁▂▂▃▄▁
  █████▅▄▃▄▆████████▇▆▅▅▄▃▃▃▃▂▂▂▂▂▂▂▁▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  1.55 μs        Histogram: frequency by time        5.72 μs <

 Memory estimate: 2.66 KiB, allocs estimate: 34.

# with copying
julia> @benchmark hcat($df1, $df2[!, 2:end])
BenchmarkTools.Trial: 633 samples with 1 evaluation.
 Range (min … max):  4.387 ms … 50.850 ms  ┊ GC (min … max):  0.00% … 86.24%
 Time  (median):     6.380 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   7.876 ms ±  6.799 ms  ┊ GC (mean ± σ):  17.48% ± 16.80%

  ▄▅██▅▃▂
  ███████▆▄▅▅▄▄▆▅▁▁▁▁▅▄▁▁▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▄▄▄▄▅▆▄▅▅▄▅ ▇
  4.39 ms      Histogram: log(frequency) by time     42.1 ms <

 Memory estimate: 22.89 MiB, allocs estimate: 64.

julia> hcat(df1, df2[!, 2:end]) == hcat(df1, df2[!, 2:end], copycols=false) == innerjoin(df1, df2, on = :x) == leftjoin(df1, df2, on = :x)
true

# Element-wise comparison (However, the above expression is enough)
julia> all(Matrix(hcat(df1, df2[!, 2:end]) .== hcat(df1, df2[!, 2:end], copycols=false) .== innerjoin(df1, df2, on = :x) .== leftjoin(df1, df2, on = :x)))
true

If I want to make a summary:

Time Memory
hcat ~24,500x faster ~19,000x fewer
hcat(with copy) ~7x faster ~1.7x fewer

*Note that the comparisons are against your best result, which is related to innerjoin(df1, df2, on = :x)


Additional Note

Also, note that @benchmark contains comprehensive results, and it contains @btime as well: enter image description here

So you don't need to necessarily run @btime to get it!

Upvotes: 2

Related Questions