Reputation: 4055
I have some data that has duplicate fields with the exception of a single field which I would like to join. In the data everything but the report
should stay the same on each day and each company. Companies can file multiple reports on the same day.
I can join using the following code but I am losing the variables which are not in my by
function. Any suggestions?
using DataFrames
# Number of observations
n = 100
words = split("the wigdet drop air flat fall fling flap freeze flop tool fox", " ")
df = DataFrame(day = cumsum(rand(0:1, n)), company = rand(0:3, n),
report = [join(rand(words, rand(1:5, 1)[1]), " ") for i in 1:n])
x = df[:, [:day, :company]]
# Number of variables which are identical for each day/company.
nv = 100
for i in 1:nv
df[:, Symbol("v" * string(i))] = ""
end
for i in 1:size(x, 1),j in 1:nv
df[(df.day .== x[i,1]) .& (df.company .== x[i,2]), Symbol("v" * string(j))] =
join(rand('a':'z', 3), "")
end
outdf = by(df, [:company, :day]) do sub
t = DataFrame(fullreport = join(sub.report, "\n(Joined)\n"))
end
Upvotes: 2
Views: 224
Reputation: 69869
Here are some minor tweaks in your data preparation code:
using DataFrames
# Number of observations
n = 100
words = split("the wigdet drop air flat fall fling flap freeze flop tool fox", " ")
df = DataFrame(day = cumsum(rand(0:1, n)), company = rand(0:3, n),
report = [join(rand(words, rand(1:5, 1)[1]), " ") for i in 1:n])
x = df[:, [:day, :company]]
# Number of variables which are identical for each day/company.
nv = 100
for i in 1:nv
df[:, Symbol("v", i)] .= ""
end
for i in 1:size(x, 1), j in 1:nv
df[(df.day .== x[i,1]) .& (df.company .== x[i,2]), Symbol("v", j)] .= join(rand('a':'z', 3), "")
end
and here is by
that keeps all other variables (assuming they are constant per group, this code should be efficient even for relatively large data):
outdf = by(df, [:company, :day]) do sub
merge((fullreport = join(sub.report, "\n(Joined)\n"),),
copy(sub[1, Not([:company, :day, :report])]))
end
I put the fullreport
variable as a first one.
Here is the code that would keep all rows from the original data frame:
outdf = by(df, [:company, :day]) do sub
insertcols!(select(sub, Not([:company, :day, :report])), 1,
fullreport = join(sub.report, "\n(Joined)\n"))
end
and now you can e.g. check that unique(outdf)
produces the same data frame as the one that was generated by the fist by
.
(in the codes above I dropped also :report
variable as I guess you did not want it in the result - right?)
Upvotes: 1