Joining list of dataframes in Julia

Question

I am trying to speed up the loop, in which consecutive dataframes are joined with the first one with the first column as key. Dataframes are produced by a function my_function. First column is named :REF. Consecutive Dataframes could be shorter that first one, thus I cannot directly assign to DF column, as I would do in pandas.

base_df = my_function(elem1)

for elem in elems[2:end]
    tmp = my_function(elem)
    base_df = join(base_df, tmp, on=:REF, kind=:left)
end

Is there any way to join list of dataframes into one? Thanks,

PS: DataFrames are of different types: String, Int, Float64.

Upd. So, example DataFrames:

df1 = DataFrame(REF = 1:5, D1=rand(5))
df2 = DataFrame(REF = 1:3, D1=rand(3))
df3 = DataFrame(REF = 1:4, D1=rand(4))

What I looking for it to combine those three (or more) into single DataFrame at once. Note the row count differencies.

Upd2. Sorry, it should have been diffent columns on df1, df2 and df3 (D1, D2 and D3). Here is the correct setup of DFs

df1 = DataFrame(REF = 1:5, D1=rand(5))
df2 = DataFrame(REF = 1:3, D2=rand(3))
df3 = DataFrame(REF = 1:4, D3=rand(4))

Bogumił Kamiński · Accepted Answer

Here is an alternative approach that assumes you want a left join (as in your question - if you need another type of join it should be simple to adjust it). The difference from Dan Getz solution is that it does not use DataVector but operates on arrays allowing missing (you can check the difference by running showcols on resulting DataFrame; the benefit is that it will be more efficient to work with such data later as we will know their types):

function joiner(ref_left, ref_right, val_right)
    x = DataFrames.similar_missing(val_right, length(ref_left))
    j = 1
    for i in 1:length(ref_left)
        while ref_left[i] > ref_right[j]
            j += 1
            j > length(ref_right) && return x
        end
        if ref_left[i] == ref_right[j]
            x[i] = val_right[j]
        end
    end
    return x
end

function left_join_sorted(elems::Vector{DataFrame}, on::Symbol)
    # we perform left join to base_df
    # the columns of elems[1] will be reused, use deepcopy if you want fresh columns
    base_df = copy(elems[1])
    ref_left = base_df[:REF]
    for i in 2:length(elems)
        df = elems[i]
        ref_right = df[:REF]
        for n in names(df)
            if n != on
                # this assumes that column names in all data frames except on are unique, otherwise they will be overwritten
                # we perform left join to the first DataFrame in elems
                base_df[n] = joiner(ref_left, ref_right, df[n])
            end
        end
    end
    base_df
end

Here is an example of usage:

julia> left_join_sorted([df1, df2, df3], :REF)
5×4 DataFrames.DataFrame
│ Row │ REF │ D1       │ D2        │ D3       │
├─────┼─────┼──────────┼───────────┼──────────┤
│ 1   │ 1   │ 0.133361 │ 0.179822  │ 0.200842 │
│ 2   │ 2   │ 0.548581 │ 0.836018  │ 0.906814 │
│ 3   │ 3   │ 0.304062 │ 0.0797432 │ 0.946639 │
│ 4   │ 4   │ 0.755515 │ missing   │ 0.519437 │
│ 5   │ 5   │ 0.571302 │ missing   │ missing  │

As a side benefit my benchmarks show that this is ~20x faster than using DataVector (if you want a further speedup use @inbounds but probably the benefits are not worth the risks).

EDIT: fixed condition in joiner loop.

Joining list of dataframes in Julia

Answers (2)

Related Questions