crayxt
crayxt

Reputation: 2405

Joining list of dataframes in Julia

I am trying to speed up the loop, in which consecutive dataframes are joined with the first one with the first column as key. Dataframes are produced by a function my_function. First column is named :REF. Consecutive Dataframes could be shorter that first one, thus I cannot directly assign to DF column, as I would do in pandas.

base_df = my_function(elem1)

for elem in elems[2:end]
    tmp = my_function(elem)
    base_df = join(base_df, tmp, on=:REF, kind=:left)
end

Is there any way to join list of dataframes into one? Thanks,

PS: DataFrames are of different types: String, Int, Float64.

Upd. So, example DataFrames:

df1 = DataFrame(REF = 1:5, D1=rand(5))
df2 = DataFrame(REF = 1:3, D1=rand(3))
df3 = DataFrame(REF = 1:4, D1=rand(4))

What I looking for it to combine those three (or more) into single DataFrame at once. Note the row count differencies.

Upd2. Sorry, it should have been diffent columns on df1, df2 and df3 (D1, D2 and D3). Here is the correct setup of DFs

df1 = DataFrame(REF = 1:5, D1=rand(5))
df2 = DataFrame(REF = 1:3, D2=rand(3))
df3 = DataFrame(REF = 1:4, D3=rand(4))

Upvotes: 2

Views: 1533

Answers (2)

Bogumił Kamiński
Bogumił Kamiński

Reputation: 69949

Here is an alternative approach that assumes you want a left join (as in your question - if you need another type of join it should be simple to adjust it). The difference from Dan Getz solution is that it does not use DataVector but operates on arrays allowing missing (you can check the difference by running showcols on resulting DataFrame; the benefit is that it will be more efficient to work with such data later as we will know their types):

function joiner(ref_left, ref_right, val_right)
    x = DataFrames.similar_missing(val_right, length(ref_left))
    j = 1
    for i in 1:length(ref_left)
        while ref_left[i] > ref_right[j]
            j += 1
            j > length(ref_right) && return x
        end
        if ref_left[i] == ref_right[j]
            x[i] = val_right[j]
        end
    end
    return x
end

function left_join_sorted(elems::Vector{DataFrame}, on::Symbol)
    # we perform left join to base_df
    # the columns of elems[1] will be reused, use deepcopy if you want fresh columns
    base_df = copy(elems[1])
    ref_left = base_df[:REF]
    for i in 2:length(elems)
        df = elems[i]
        ref_right = df[:REF]
        for n in names(df)
            if n != on
                # this assumes that column names in all data frames except on are unique, otherwise they will be overwritten
                # we perform left join to the first DataFrame in elems
                base_df[n] = joiner(ref_left, ref_right, df[n])
            end
        end
    end
    base_df
end

Here is an example of usage:

julia> left_join_sorted([df1, df2, df3], :REF)
5×4 DataFrames.DataFrame
│ Row │ REF │ D1       │ D2        │ D3       │
├─────┼─────┼──────────┼───────────┼──────────┤
│ 1   │ 1   │ 0.133361 │ 0.179822  │ 0.200842 │
│ 2   │ 2   │ 0.548581 │ 0.836018  │ 0.906814 │
│ 3   │ 3   │ 0.304062 │ 0.0797432 │ 0.946639 │
│ 4   │ 4   │ 0.755515 │ missing   │ 0.519437 │
│ 5   │ 5   │ 0.571302 │ missing   │ missing  │

As a side benefit my benchmarks show that this is ~20x faster than using DataVector (if you want a further speedup use @inbounds but probably the benefits are not worth the risks).

EDIT: fixed condition in joiner loop.

Upvotes: 2

Dan Getz
Dan Getz

Reputation: 18227

Just to setup the answer, suppose:

df1 = DataFrame(REF = 1:5, D1=rand(5))
df2 = DataFrame(REF = 1:3, D1=rand(3))
df3 = DataFrame(REF = 1:4, D1=rand(4))

elems = [df1, df2, df3]
my_function = identity

Now the code to generate the big DataFrame:

dfs = my_function.(elems)
base_df = DataFrame(Dict([f=>vcat(getindex.(dfs,f)...) for f in names(dfs[1])]...))

Giving something like:

12×2 DataFrames.DataFrame
│ Row │ D1         │ REF │
├─────┼────────────┼─────┤
│ 1   │ 0.664144   │ 1   │
│ 2   │ 0.119155   │ 2   │
│ 3   │ 0.471053   │ 3   │
│ 4   │ 0.547811   │ 4   │
│ 5   │ 0.600263   │ 5   │
│ 6   │ 0.21306    │ 1   │
│ 7   │ 0.985412   │ 2   │
│ 8   │ 0.886738   │ 3   │
│ 9   │ 0.00926173 │ 1   │
│ 10  │ 0.701962   │ 2   │
│ 11  │ 0.328322   │ 3   │
│ 12  │ 0.753062   │ 4   │

This approach reduces the memory used from quadratic-ish to linear-ish (and performance improves inline with memory reduction)

UPDATE

As new details came to light (and my understanding of the question improved), here is code to better generate the desired base_df:

df1 = DataFrame(REF = 1:5, D1=rand(5))
df2 = DataFrame(REF = 1:3, D2=rand(3))
df3 = DataFrame(REF = 1:4, D3=rand(4))
elems = [df1, df2, df3]

cols = [(i,f) for (i,t) in enumerate(elems) for f in names(t) if !(f == :REF)]
rows = union(getindex.(elems,:REF)...)
ref2row = Dict(v=>i for (i,v) in enumerate(rows))

pre_df = Dict{Symbol,DataVector{Any}}([c[2]=>DataArray(eltype(elems[c[1]][c[2]]),
 length(rows)) for c in cols])

foreach(tpl -> pre_df[tpl[3][1]][ref2row[tpl[2]]] = tpl[3][2],
 [(i,r[:REF],v) 
  for (i,t) in enumerate(elems) 
  for r in eachrow(t) 
  for v in r if v[1] != :REF
 ])

pre_df[:REF] = [ref2row[i] for i=1:length(rows)]

base_df = DataFrame(pre_df)

Giving:

5×4 DataFrames.DataFrame
│ Row │ D1       │ D2       │ D3        │ REF │
├─────┼──────────┼──────────┼───────────┼─────┤
│ 1   │ 0.93479  │ 0.582954 │ 0.133983  │ 1   │
│ 2   │ 0.472456 │ 0.992173 │ 0.32442   │ 2   │
│ 3   │ 0.365478 │ 0.117772 │ 0.62522   │ 3   │
│ 4   │ 0.976192 │ NA       │ 0.0861988 │ 4   │
│ 5   │ 0.76358  │ NA       │ NA        │ 5   │

Upvotes: 2

Related Questions