Julia: Collapsing DataFrame by multiple values retaining additional variables

Question

I have some data that has duplicate fields with the exception of a single field which I would like to join. In the data everything but the report should stay the same on each day and each company. Companies can file multiple reports on the same day.

I can join using the following code but I am losing the variables which are not in my by function. Any suggestions?

Mock Data

using DataFrames

# Number of observations
n = 100
words = split("the wigdet drop air flat fall fling flap freeze flop tool fox", " ")
df = DataFrame(day = cumsum(rand(0:1, n)), company = rand(0:3, n), 
  report = [join(rand(words, rand(1:5, 1)[1]), " ") for i in 1:n])

x = df[:, [:day, :company]]

# Number of variables which are identical for each day/company.
nv = 100
for i in 1:nv
    df[:, Symbol("v" * string(i))] = ""
end

for i in 1:size(x, 1),j in 1:nv
    df[(df.day .== x[i,1]) .& (df.company .== x[i,2]), Symbol("v" * string(j))] = 
    join(rand('a':'z', 3), "")
end

Collapsed data

outdf = by(df, [:company, :day]) do sub
  t = DataFrame(fullreport = join(sub.report, "
(Joined)
"))
end

Julia: Collapsing DataFrame by multiple values retaining additional variables

Mock Data

Collapsed data

Answers (1)

Related Questions