Reputation: 909
I have the following data frame in Julia.
using DataFrames
data = DataFrame(Value = [23, 56, 10, 48, 51], Type = ["A", "B", "A", "B", "B"])
5×2 DataFrame
│ Row │ Value │ Type │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 23 │ A │
│ 2 │ 56 │ B │
│ 3 │ 10 │ A │
│ 4 │ 48 │ B │
│ 5 │ 51 │ B │
How to get the mean of the column Value based on the column Type?
Upvotes: 3
Views: 1898
Reputation: 69819
If you want performance consider the following options
julia> using DataFrames
julia> using Statistics
julia> using BenchmarkTools
julia> data = DataFrame(Value = rand(1:10, 10^6),
Type = categorical(rand(["A", "B"], 10^6)));
Note that I generate :Type
column as categorical, as this will be much faster to aggregate later.
First a timing from the answer above:
julia> @benchmark by($data, [:Type], df -> mean(df[:, :Value]))
BenchmarkTools.Trial:
memory estimate: 30.53 MiB
allocs estimate: 212
--------------
minimum time: 12.173 ms (0.00% GC)
median time: 13.305 ms (3.63% GC)
mean time: 14.229 ms (4.30% GC)
maximum time: 20.491 ms (2.98% GC)
--------------
samples: 352
evals/sample: 1
Here is a timing where I change df[:, :Value]
to df.Value
. The difference is that df.Value
does not copy data unnecessarily. You can see that already you save over 10% of run time:
julia> @benchmark by($data, :Type, df -> mean(df.Value))
BenchmarkTools.Trial:
memory estimate: 22.90 MiB
allocs estimate: 203
--------------
minimum time: 10.926 ms (0.00% GC)
median time: 13.151 ms (1.92% GC)
mean time: 13.093 ms (3.53% GC)
maximum time: 16.933 ms (3.25% GC)
--------------
samples: 382
evals/sample: 1
And here is an efficient way to write it. This statement means that we pass column :Value
to a function mean
:
julia> @benchmark by($data, :Type, :Value => mean)
BenchmarkTools.Trial:
memory estimate: 15.27 MiB
allocs estimate: 190
--------------
minimum time: 8.326 ms (0.00% GC)
median time: 8.667 ms (0.00% GC)
mean time: 9.599 ms (2.74% GC)
maximum time: 17.364 ms (3.57% GC)
--------------
samples: 521
evals/sample: 1
To finalize let us just check the difference if :Value
is a Vector{String}
(the approach that is given in the other answer):
julia> data.Type = String.(data.Type);
julia> @benchmark by($data, [:Type], df -> mean(df[:, :Value]))
BenchmarkTools.Trial:
memory estimate: 46.16 MiB
allocs estimate: 197
--------------
minimum time: 26.664 ms (2.08% GC)
median time: 27.197 ms (2.11% GC)
mean time: 27.486 ms (2.11% GC)
maximum time: 35.740 ms (1.64% GC)
--------------
samples: 182
evals/sample: 1
And you can see that it is around three times slower than the recommended answer. Also note that:
julia> by(data, :Type, :Value => mean)
2×2 DataFrame
│ Row │ Type │ Value_mean │
│ │ String │ Float64 │
├─────┼────────┼────────────┤
│ 1 │ B │ 5.50175 │
│ 2 │ A │ 5.49524 │
produces a nicer default name for the generated column (as it knows the source column name and the transformation function name).
Upvotes: 5
Reputation: 13800
A more succinct way of writing this would be:
julia> by(data, :Type, :Value => mean)
2×2 DataFrame
│ Row │ Type │ Value_mean │
│ │ String │ Float64 │
├─────┼────────┼────────────┤
│ 1 │ A │ 16.5 │
│ 2 │ B │ 51.6667 │
Note that if grouping by one variable, it is not necessary to pass an array as the second argument. Also, the function(s) to apply can be passed directly as a Pair
of column name and function to apply. This can be extended to additional functions as well:
julia> by(data, :Type, :Value => mean, :Value => median)
2×3 DataFrame
│ Row │ Type │ Value_mean │ Value_median │
│ │ String │ Float64 │ Float64 │
├─────┼────────┼────────────┼──────────────┤
│ 1 │ A │ 16.5 │ 16.5 │
│ 2 │ B │ 51.6667 │ 51.0 │
This will create new columns which automatically append the function name to the column that is being grouped. These default names can be overwritten by passing a new column name like so:
julia> by(data, :Type, my_new_column = :Value => mean)
2×2 DataFrame
│ Row │ Type │ my_new_column │
│ │ String │ Float64 │
├─────┼────────┼───────────────┤
│ 1 │ A │ 16.5 │
│ 2 │ B │ 51.6667 │
Upvotes: 4
Reputation: 909
Using the function by()
will group the rows of the column and then applying the function mean()
(from Statistics
) will result in a single value for each Type: the mean of the Type A and of the Type B.
using DataFrames
using Statistics
data = DataFrame(Value = [23, 56, 10, 48, 51], Type = ["A", "B", "A", "B", "B"]);
by(data, [:Type], df -> mean(df[:, :Value]))
2×2 DataFrame
│ Row │ Type │ x1 │
│ │ String │ Float64 │
├─────┼────────┼─────────┤
│ 1 │ A │ 16.5 │
│ 2 │ B │ 51.6667 │
More information on data frames in Julia can be found here: https://juliadata.github.io/DataFrames.jl/stable/man/getting_started/
Upvotes: 1