Reputation: 233
Suppose I have two DataFrames df1
and df2
as follows
df1 = DataFrame(id=["a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d"],
var=[1, 32, 3, 22, 5, 4, 6, 7, 8, 4, 3])
11×2 DataFrame
Row │ id var
│ String Int64
─────┼───────────────
1 │ a 1
2 │ a 32
3 │ a 3
4 │ b 22
5 │ b 5
6 │ b 4
7 │ c 6
8 │ c 7
9 │ c 8
10 │ d 4
11 │ d 3
df2 = DataFrame(id=["a", "a", "b", "b", "b", "c", "c", "c"],
var=[1, 1, 2, 2, 2, 6, 6, 6])
8×2 DataFrame
Row │ id var
│ String Int64
─────┼───────────────
1 │ a 1
2 │ a 1
3 │ b 2
4 │ b 2
5 │ b 2
6 │ c 6
7 │ c 6
8 │ c 6
The objective is to replace the var
column in df1 for each id
with the value of var
from df2
for each corresponding id
, only for those id
which exist in both df2
and df1
So the desired outcome will look like:
DataFrame(id=["a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d"],
var=[1, 32, 3, 22, 5, 4, 6, 7, 8, 4, 3])
11×2 DataFrame
Row │ id var
│ String Int64
─────┼───────────────
1 │ a 1
2 │ a 1
3 │ a 1
4 │ b 2
5 │ b 2
6 │ b 2
7 │ c 6
8 │ c 6
9 │ c 6
10 │ d 4
11 │ d 3
Tried the following but they don't work
for d1 in groupby(df1, :id)
replace!(d1.var .= [d2.var for d1 in groupby(df2, :id)])
end
#or
[[d1.var = d2.var for d2 in groupby(df2, :id)] for d1 in groupby(df1, :id)]
Will appreciate any help. Thanks!
Upvotes: 2
Views: 559
Reputation: 69949
I would do it like this:
julia> leftjoin!(df1, unique(df2, :id), on=:id, makeunique=true)
11×3 DataFrame
Row │ id var var_1
│ String Int64 Int64?
─────┼────────────────────────
1 │ a 1 1
2 │ a 32 1
3 │ a 3 1
4 │ b 22 2
5 │ b 5 2
6 │ b 4 2
7 │ c 6 6
8 │ c 7 6
9 │ c 8 6
10 │ d 4 missing
11 │ d 3 missing
julia> select!(df1, :id, [:var_1, :var] => ByRow(coalesce) => :var)
11×2 DataFrame
Row │ id var
│ String Int64
─────┼───────────────
1 │ a 1
2 │ a 1
3 │ a 1
4 │ b 2
5 │ b 2
6 │ b 2
7 │ c 6
8 │ c 6
9 │ c 6
10 │ d 4
11 │ d 3
Note that the problem with your data is that df2
has multiple rows for the same unique :id
so I run unique
on it before joining (also note that number of values per group in df1
and df2
is not the same).
There are other ways to do it (please comment if you would like to see them), but they would involve more code (iteration through groups etc.). The solution I propose relies on the functions that are part of DataFrames.jl API.
Performance comparison (first run includes compilation time):
julia> repeat!(df1, 10^6);
julia> function sol1(df1, df2)
leftjoin!(df1, unique(df2, :id), on=:id, makeunique=true)
select!(df1, :id, [:var_1, :var] => ByRow(coalesce) => :var)
end
sol1 (generic function with 1 method)
julia> function sol2(df1, df2)
D = Dict(df2[!, :id] .=> df2[!, :var])
for (i, v) in enumerate(df1[!, :id])
if v in keys(D)
df1[!, :var][i] = D[v]
end
end
end
sol2 (generic function with 1 method)
julia> function sol3(var1, var2, id1, id2)
D = Dict(id2 .=> var2)
for (i, v) in enumerate(id1)
if v in keys(D)
var = D[v]
end
end
end
sol3 (generic function with 3 methods)
julia> x = copy(df1);
julia> @time sol1(x, df2);
3.103564 seconds (4.97 M allocations: 685.190 MiB, 4.84% gc time, 80.37% compilation time)
julia> x = copy(df1);
julia> @time sol1(x, df2);
0.660479 seconds (585 allocations: 409.727 MiB, 23.24% gc time)
julia> x = copy(df1);
julia> @time sol2(x, df2);
3.462888 seconds (55.36 M allocations: 1.495 GiB, 6.22% gc time, 3.58% compilation time)
julia> x = copy(df1);
julia> @time sol2(x, df2);
3.382529 seconds (55.00 M allocations: 1.475 GiB, 7.24% gc time)
julia> x = copy(df1);
julia> @time sol3(x.var, df2.var, x.id, df2.id);
0.380297 seconds (142.77 k allocations: 7.360 MiB, 9.58% compilation time)
julia> x = copy(df1);
julia> @time sol3(x.var, df2.var, x.id, df2.id);
0.331760 seconds (5 allocations: 720 bytes)
Upvotes: 3
Reputation: 5559
I believe @BogumiłKamiński answer is the way to go if you decide to use what DataFrames.jl API has to offer. If you want to try something longer but way faster, however, you might want to try the following:
D = Dict(df2[!, :id] .=> df2[!, :var])
for (i, v) in enumerate(df1[!, :id])
if v in keys(D)
df1[!, :var][i] = D[v]
end
end
Julia > df1
11×2 DataFrame
Row │ id var
│ String Int64
─────┼───────────────
1 │ a 1
2 │ a 1
3 │ a 1
4 │ b 2
5 │ b 2
6 │ b 2
7 │ c 6
8 │ c 6
9 │ c 6
10 │ d 4
11 │ d 3
Upvotes: 2