Reputation: 909

How to find mean value of rows of a data frame based on values in a column in Julia?

I have the following data frame in Julia.

using DataFrames 
data = DataFrame(Value = [23, 56, 10, 48, 51], Type = ["A", "B", "A", "B", "B"])

5×2 DataFrame
│ Row │ Value │ Type   │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 23    │ A      │
│ 2   │ 56    │ B      │
│ 3   │ 10    │ A      │
│ 4   │ 48    │ B      │
│ 5   │ 51    │ B      │

How to get the mean of the column Value based on the column Type?

Upvotes: 3

Answers (3)

Bogumił Kamiński

Reputation: 69819

If you want performance consider the following options

julia> using DataFrames

julia> using Statistics

julia> using BenchmarkTools

julia> data = DataFrame(Value = rand(1:10, 10^6),
                        Type = categorical(rand(["A", "B"], 10^6)));

Note that I generate :Type column as categorical, as this will be much faster to aggregate later.

First a timing from the answer above:

julia> @benchmark by($data, [:Type], df -> mean(df[:, :Value]))
BenchmarkTools.Trial: 
  memory estimate:  30.53 MiB
  allocs estimate:  212
  --------------
  minimum time:     12.173 ms (0.00% GC)
  median time:      13.305 ms (3.63% GC)
  mean time:        14.229 ms (4.30% GC)
  maximum time:     20.491 ms (2.98% GC)
  --------------
  samples:          352
  evals/sample:     1

Here is a timing where I change df[:, :Value] to df.Value. The difference is that df.Value does not copy data unnecessarily. You can see that already you save over 10% of run time:

julia> @benchmark by($data, :Type, df -> mean(df.Value))
BenchmarkTools.Trial: 
  memory estimate:  22.90 MiB
  allocs estimate:  203
  --------------
  minimum time:     10.926 ms (0.00% GC)
  median time:      13.151 ms (1.92% GC)
  mean time:        13.093 ms (3.53% GC)
  maximum time:     16.933 ms (3.25% GC)
  --------------
  samples:          382
  evals/sample:     1

And here is an efficient way to write it. This statement means that we pass column :Value to a function mean:

julia> @benchmark by($data, :Type, :Value => mean)
BenchmarkTools.Trial: 
  memory estimate:  15.27 MiB
  allocs estimate:  190
  --------------
  minimum time:     8.326 ms (0.00% GC)
  median time:      8.667 ms (0.00% GC)
  mean time:        9.599 ms (2.74% GC)
  maximum time:     17.364 ms (3.57% GC)
  --------------
  samples:          521
  evals/sample:     1

To finalize let us just check the difference if :Value is a Vector{String} (the approach that is given in the other answer):

julia> data.Type = String.(data.Type);

julia> @benchmark by($data, [:Type], df -> mean(df[:, :Value]))
BenchmarkTools.Trial: 
  memory estimate:  46.16 MiB
  allocs estimate:  197
  --------------
  minimum time:     26.664 ms (2.08% GC)
  median time:      27.197 ms (2.11% GC)
  mean time:        27.486 ms (2.11% GC)
  maximum time:     35.740 ms (1.64% GC)
  --------------
  samples:          182
  evals/sample:     1

And you can see that it is around three times slower than the recommended answer. Also note that:

julia> by(data, :Type, :Value => mean)
2×2 DataFrame
│ Row │ Type   │ Value_mean │
│     │ String │ Float64    │
├─────┼────────┼────────────┤
│ 1   │ B      │ 5.50175    │
│ 2   │ A      │ 5.49524    │

produces a nicer default name for the generated column (as it knows the source column name and the transformation function name).

Upvotes: 5

Nils Gudat

Reputation: 13800

A more succinct way of writing this would be:

julia> by(data, :Type, :Value => mean)
2×2 DataFrame
│ Row │ Type   │ Value_mean │
│     │ String │ Float64    │
├─────┼────────┼────────────┤
│ 1   │ A      │ 16.5       │
│ 2   │ B      │ 51.6667    │

Note that if grouping by one variable, it is not necessary to pass an array as the second argument. Also, the function(s) to apply can be passed directly as a Pair of column name and function to apply. This can be extended to additional functions as well:

julia> by(data, :Type, :Value => mean, :Value => median)
2×3 DataFrame
│ Row │ Type   │ Value_mean │ Value_median │
│     │ String │ Float64    │ Float64      │
├─────┼────────┼────────────┼──────────────┤
│ 1   │ A      │ 16.5       │ 16.5         │
│ 2   │ B      │ 51.6667    │ 51.0         │

This will create new columns which automatically append the function name to the column that is being grouped. These default names can be overwritten by passing a new column name like so:

julia> by(data, :Type, my_new_column = :Value => mean)
2×2 DataFrame
│ Row │ Type   │ my_new_column │
│     │ String │ Float64       │
├─────┼────────┼───────────────┤
│ 1   │ A      │ 16.5          │
│ 2   │ B      │ 51.6667       │

Upvotes: 4

Qwerty

Reputation: 909

Using the function by() will group the rows of the column and then applying the function mean() (from Statistics) will result in a single value for each Type: the mean of the Type A and of the Type B.

using DataFrames
using Statistics
data = DataFrame(Value = [23, 56, 10, 48, 51], Type = ["A", "B", "A", "B", "B"]);
by(data, [:Type], df -> mean(df[:, :Value]))

2×2 DataFrame
│ Row │ Type   │ x1      │
│     │ String │ Float64 │
├─────┼────────┼─────────┤
│ 1   │ A      │ 16.5    │
│ 2   │ B      │ 51.6667 │

More information on data frames in Julia can be found here: https://juliadata.github.io/DataFrames.jl/stable/man/getting_started/

Upvotes: 1

How to find mean value of rows of a data frame based on values in a column in Julia?

Answers (3)

Related Questions