Reputation: 6431
Given a Julia dataframe with many variables and a final class column:
julia> df
5×3 DataFrame
Row │ v1 v2 cl
│ Int64? Int64 Int64
─────┼───────────────────────
1 │ 10 1 2
2 │ 20 2 2
3 │ 300 10 1
4 │ 400 20 1
5 │ missing 30 1
I want to obtain a grouped df with summary statistics by class, with the classes by column and variables by rows, like:
julia> dfByCl
11×3 DataFrame
Row │ var cl_1 cl_2
│ String Float64? Float64?
─────┼───────────────────────────────
1 │ nrow 3.0 2.0
2 │ v1_mean 350.0 15.0
3 │ v1_std 70.7107 7.07107
4 │ v1_lb 252.002 5.20018
5 │ v1_ub 447.998 24.7998
6 │ v1_nm 2.0 2.0
7 │ v2_mean 20.0 1.5
8 │ v2_std 10.0 0.707107
9 │ v2_lb 8.68414 0.520018
10 │ v2_ub 31.3159 2.47998
11 │ v2_nm 3.0 2.0
and I don't want to explicitly name all the variables. Is there something simpler/more elegant than the above code ?
using Statistics, DataFrames, Distributions
meansk(data) = mean(skipmissing(data))
stdsk(data) = std(skipmissing(data))
nm(data) = sum(.! ismissing.(data))
ci(data::AbstractVector,α=0.05) = meansk(data) - quantile(Normal(),1-α/2)*stdsk(data)/sqrt(nm(data)), meansk(data) + quantile(Normal(),1-α/2)*stdsk(data)/sqrt(nm(data))
cilb(data) = ci(data)[1]
ciub(data) = ci(data)[2]
df = DataFrame(v1=[10,20,300,400,missing],v2=[1,2,10,20,30],cl=[2,2,1,1,1])
dfByCl_w = combine(groupby(df,["cl"]),
nrow,
names(df) .=> meansk .=> col -> col * "_mean",
names(df) .=> stdsk .=> col -> col * "_std",
names(df) .=> cilb .=> col -> col * "_lb",
names(df) .=> ciub .=> col -> col * "_ub",
names(df) .=> nm .=> col -> col * "_nm",
)
orderedNames = vcat("cl","nrow",[ ["$(n)_mean", "$(n)_std", "$(n)_lb", "$(n)_ub", "$(n)_nm"] for n in names(df)[1:end-1]]...)
dfByCl_w = dfByCl_w[:, orderedNames]
toStack = vcat("nrow",[ ["$(n)_mean", "$(n)_std", "$(n)_lb", "$(n)_ub", "$(n)_nm"] for n in names(df)[1:end-1]]...)
dfByCl_l = stack(dfByCl_w,toStack)
dfByCl = unstack(dfByCl_l,"cl","value")
rename!(dfByCl,vcat("var",["cl_$(c)" for c in unique(dfByCl_w.cl)]))
Upvotes: 2
Views: 93
Reputation: 69819
Here is what I would normally do in such a case:
julia> cilb(data,α=0.05) = mean(data) - quantile(Normal(),1-α/2)*std(data)/sqrt(count(x -> true, data))
cilb (generic function with 2 methods)
julia> ciub(data,α=0.05) = mean(data) + quantile(Normal(),1-α/2)*std(data)/sqrt(count(x -> true, data))
ciub (generic function with 2 methods)
julia> combine(groupby(df, :cl),
nrow,
sdf -> describe(sdf, :mean, :std, cilb => :lb, ciub => :ub, :nmissing, cols=r"v"))
4×8 DataFrame
Row │ cl nrow variable mean std lb ub nmissing
│ Int64 Int64 Symbol Float64 Float64 Float64 Float64 Int64
─────┼─────────────────────────────────────────────────────────────────────────────
1 │ 1 3 v1 350.0 70.7107 252.002 447.998 1
2 │ 1 3 v2 20.0 10.0 8.68414 31.3159 0
3 │ 2 2 v1 15.0 7.07107 5.20018 24.7998 0
4 │ 2 2 v2 1.5 0.707107 0.520018 2.47998 0
Later you can reshape it as you like, but maybe this layout is something you would actually want?
Upvotes: 2