Reputation: 2405
I am trying to speed up the loop, in which consecutive dataframes are joined with the first one with the first column as key. Dataframes are produced by a function my_function
. First column is named :REF
. Consecutive Dataframes could be shorter that first one, thus I cannot directly assign to DF column, as I would do in pandas.
base_df = my_function(elem1)
for elem in elems[2:end]
tmp = my_function(elem)
base_df = join(base_df, tmp, on=:REF, kind=:left)
end
Is there any way to join list of dataframes into one? Thanks,
PS: DataFrames are of different types: String, Int, Float64.
Upd. So, example DataFrames:
df1 = DataFrame(REF = 1:5, D1=rand(5))
df2 = DataFrame(REF = 1:3, D1=rand(3))
df3 = DataFrame(REF = 1:4, D1=rand(4))
What I looking for it to combine those three (or more) into single DataFrame at once. Note the row count differencies.
Upd2. Sorry, it should have been diffent columns on df1, df2 and df3 (D1, D2 and D3). Here is the correct setup of DFs
df1 = DataFrame(REF = 1:5, D1=rand(5))
df2 = DataFrame(REF = 1:3, D2=rand(3))
df3 = DataFrame(REF = 1:4, D3=rand(4))
Upvotes: 2
Views: 1533
Reputation: 69949
Here is an alternative approach that assumes you want a left join (as in your question - if you need another type of join it should be simple to adjust it). The difference from Dan Getz solution is that it does not use DataVector
but operates on arrays allowing missing
(you can check the difference by running showcols
on resulting DataFrame
; the benefit is that it will be more efficient to work with such data later as we will know their types):
function joiner(ref_left, ref_right, val_right)
x = DataFrames.similar_missing(val_right, length(ref_left))
j = 1
for i in 1:length(ref_left)
while ref_left[i] > ref_right[j]
j += 1
j > length(ref_right) && return x
end
if ref_left[i] == ref_right[j]
x[i] = val_right[j]
end
end
return x
end
function left_join_sorted(elems::Vector{DataFrame}, on::Symbol)
# we perform left join to base_df
# the columns of elems[1] will be reused, use deepcopy if you want fresh columns
base_df = copy(elems[1])
ref_left = base_df[:REF]
for i in 2:length(elems)
df = elems[i]
ref_right = df[:REF]
for n in names(df)
if n != on
# this assumes that column names in all data frames except on are unique, otherwise they will be overwritten
# we perform left join to the first DataFrame in elems
base_df[n] = joiner(ref_left, ref_right, df[n])
end
end
end
base_df
end
Here is an example of usage:
julia> left_join_sorted([df1, df2, df3], :REF)
5×4 DataFrames.DataFrame
│ Row │ REF │ D1 │ D2 │ D3 │
├─────┼─────┼──────────┼───────────┼──────────┤
│ 1 │ 1 │ 0.133361 │ 0.179822 │ 0.200842 │
│ 2 │ 2 │ 0.548581 │ 0.836018 │ 0.906814 │
│ 3 │ 3 │ 0.304062 │ 0.0797432 │ 0.946639 │
│ 4 │ 4 │ 0.755515 │ missing │ 0.519437 │
│ 5 │ 5 │ 0.571302 │ missing │ missing │
As a side benefit my benchmarks show that this is ~20x faster than using DataVector
(if you want a further speedup use @inbounds
but probably the benefits are not worth the risks).
EDIT: fixed condition in joiner
loop.
Upvotes: 2
Reputation: 18227
Just to setup the answer, suppose:
df1 = DataFrame(REF = 1:5, D1=rand(5))
df2 = DataFrame(REF = 1:3, D1=rand(3))
df3 = DataFrame(REF = 1:4, D1=rand(4))
elems = [df1, df2, df3]
my_function = identity
Now the code to generate the big DataFrame:
dfs = my_function.(elems)
base_df = DataFrame(Dict([f=>vcat(getindex.(dfs,f)...) for f in names(dfs[1])]...))
Giving something like:
12×2 DataFrames.DataFrame
│ Row │ D1 │ REF │
├─────┼────────────┼─────┤
│ 1 │ 0.664144 │ 1 │
│ 2 │ 0.119155 │ 2 │
│ 3 │ 0.471053 │ 3 │
│ 4 │ 0.547811 │ 4 │
│ 5 │ 0.600263 │ 5 │
│ 6 │ 0.21306 │ 1 │
│ 7 │ 0.985412 │ 2 │
│ 8 │ 0.886738 │ 3 │
│ 9 │ 0.00926173 │ 1 │
│ 10 │ 0.701962 │ 2 │
│ 11 │ 0.328322 │ 3 │
│ 12 │ 0.753062 │ 4 │
This approach reduces the memory used from quadratic-ish to linear-ish (and performance improves inline with memory reduction)
UPDATE
As new details came to light (and my understanding of the question improved), here is code to better generate the desired base_df
:
df1 = DataFrame(REF = 1:5, D1=rand(5))
df2 = DataFrame(REF = 1:3, D2=rand(3))
df3 = DataFrame(REF = 1:4, D3=rand(4))
elems = [df1, df2, df3]
cols = [(i,f) for (i,t) in enumerate(elems) for f in names(t) if !(f == :REF)]
rows = union(getindex.(elems,:REF)...)
ref2row = Dict(v=>i for (i,v) in enumerate(rows))
pre_df = Dict{Symbol,DataVector{Any}}([c[2]=>DataArray(eltype(elems[c[1]][c[2]]),
length(rows)) for c in cols])
foreach(tpl -> pre_df[tpl[3][1]][ref2row[tpl[2]]] = tpl[3][2],
[(i,r[:REF],v)
for (i,t) in enumerate(elems)
for r in eachrow(t)
for v in r if v[1] != :REF
])
pre_df[:REF] = [ref2row[i] for i=1:length(rows)]
base_df = DataFrame(pre_df)
Giving:
5×4 DataFrames.DataFrame
│ Row │ D1 │ D2 │ D3 │ REF │
├─────┼──────────┼──────────┼───────────┼─────┤
│ 1 │ 0.93479 │ 0.582954 │ 0.133983 │ 1 │
│ 2 │ 0.472456 │ 0.992173 │ 0.32442 │ 2 │
│ 3 │ 0.365478 │ 0.117772 │ 0.62522 │ 3 │
│ 4 │ 0.976192 │ NA │ 0.0861988 │ 4 │
│ 5 │ 0.76358 │ NA │ NA │ 5 │
Upvotes: 2