Reputation: 11192
I have a df like below,
Sample Input:
4×2 DataFrame
│ Row │ col1 │ col2 │
│ │ String │ Int64 │
├─────┼────────┼───────┤
│ 1 │ l1 │ 1 │
│ 2 │ l2 │ 2 │
│ 3 │ l1 │ 3 │
│ 4 │ l2 │ 4 │
I want to transform the above df to like below,
Expected df:
2×2 DataFrame
│ Row │ l1 │ l2 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 2 │
│ 2 │ 3 │ 4 │
I tried unstack approach unstack(df, "col1", "col2")
but I got ArgumentError: No key column found
. I understood that without key unstack is not possible. How to transform the above df to expected df?
Upvotes: 2
Views: 112
Reputation: 69819
DataFrames.jl requires you to specify keys for rows to allow to perform matching them by these keys:
julia> df = DataFrame(col1=["l1", "l2", "l1", "l2"], col2=1:4, rowkey=[1,1,2,2])
4×3 DataFrame
│ Row │ col1 │ col2 │ rowkey │
│ │ String │ Int64 │ Int64 │
├─────┼────────┼───────┼────────┤
│ 1 │ l1 │ 1 │ 1 │
│ 2 │ l2 │ 2 │ 1 │
│ 3 │ l1 │ 3 │ 2 │
│ 4 │ l2 │ 4 │ 2 │
julia> unstack(df, "col1", "col2")
2×3 DataFrame
│ Row │ rowkey │ l1 │ l2 │
│ │ Int64 │ Int64? │ Int64? │
├─────┼────────┼────────┼────────┤
│ 1 │ 1 │ 1 │ 2 │
│ 2 │ 2 │ 3 │ 4 │
Why? Imagine your data looks like this:
julia> df = DataFrame(col1=["l1", "l2", "l1", "l2","l1"], col2=1:5, rowkey=[1,1,3,3,2])
5×3 DataFrame
│ Row │ col1 │ col2 │ rowkey │
│ │ String │ Int64 │ Int64 │
├─────┼────────┼───────┼────────┤
│ 1 │ l1 │ 1 │ 1 │
│ 2 │ l2 │ 2 │ 1 │
│ 3 │ l1 │ 3 │ 3 │
│ 4 │ l2 │ 4 │ 3 │
│ 5 │ l1 │ 5 │ 2 │
julia> unstack(df, "col1", "col2")
3×3 DataFrame
│ Row │ rowkey │ l1 │ l2 │
│ │ Int64 │ Int64? │ Int64? │
├─────┼────────┼────────┼─────────┤
│ 1 │ 1 │ 1 │ 2 │
│ 2 │ 2 │ 5 │ missing │
│ 3 │ 3 │ 3 │ 4 │
without :rowkey
it would be not possible to tell that you actually want the second row in :l2
to hold a missing value.
This is how unstack
works as it is a general function, so it uses row keys to perform matching. If you want something that assumes:
write this:
julia> DataFrame([first(sdf.col1) => sdf.col2 for sdf in groupby(df, :col1)])
2×2 DataFrame
│ Row │ l1 │ l2 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 2 │
│ 2 │ 3 │ 4 │
or a bit more general pattern
julia> DataFrame([key.col1 => sdf.col2 for (key, sdf) in pairs(groupby(df, :col1))])
2×2 DataFrame
│ Row │ l1 │ l2 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 2 │
│ 2 │ 3 │ 4 │
Upvotes: 2