Melierax
Melierax

Reputation: 71

How to apply a statistical test to each group of a dataframe in Julia? (tapply equivalent)

In Julia, I want to test the normality of a variable for each group defined in another column in a dataframe.

Lets say we have:

df = DataFrame(x = rand(Normal(),30), group = repeat(["A", "B"],15))

I know I can test the normality of x with :

using HypothesisTests
using Distributions
OneSampleADTest(x, Normal())

So the question is how do I test the normality of x for each group ? In R, I would use tapply() but I couldn't find the equivalent in Julia...

Upvotes: 2

Views: 219

Answers (2)

Sundar R
Sundar R

Reputation: 14735

If you want to just get the pvalue for each group in the data frame,

julia> combine(groupby(df, :group), :x => (x -> pvalue(OneSampleADTest(x, Normal()))) => :onesampleAD_pvalue)
2×2 DataFrame
 Row │ group   onesampleAD_pvalue 
     │ String  Float64            
─────┼────────────────────────────
   1 │ A                 0.275653
   2 │ B                 0.544317

If you want to print the test details (or do more complex manipulations) per group, you can instead loop over the groups too:

julia> for (key, sdf) in pairs(groupby(df, :group))
         println("Group $(key.group)")
         display(OneSampleADTest(sdf.x, Normal()))
       end
Group A
One sample Anderson-Darling test
--------------------------------
...

Group B
One sample Anderson-Darling test
--------------------------------
...

Upvotes: 2

Bogumił Kamiński
Bogumił Kamiński

Reputation: 69949

It depends what output you expect. I recommend that you store the result in a data frame (this is not what tapply does):

julia> gdf = groupby(df, :group, sort=true) # group by :group and keep groups sorted
GroupedDataFrame with 2 groups based on key: group
First Group (15 rows): group = "A"
 Row │ x          group
     │ Float64    String
─────┼───────────────────
   1 │ -0.869008  A
   2 │  0.190041  A
   3 │  0.369881  A
   4 │  0.445092  A
  ⋮  │     ⋮        ⋮
  13 │ -0.599266  A
  14 │  0.696132  A
  15 │  0.788465  A
           8 rows omitted
⋮
Last Group (15 rows): group = "B"
 Row │ x          group
     │ Float64    String
─────┼───────────────────
   1 │ -1.19973   B
   2 │  0.557241  B
   3 │ -0.425667  B
   4 │  0.787917  B
  ⋮  │     ⋮        ⋮
  13 │  1.96912   B
  14 │  0.567594  B
  15 │  1.39739   B
           8 rows omitted

julia> res = combine(gdf, :x => (x -> OneSampleADTest(x, Normal())) => :ADTest)
2×2 DataFrame
 Row │ group   ADTest
     │ String  OneSampl…
─────┼───────────────────────────────────────────
   1 │ A       One sample Anderson-Darling test…
   2 │ B       One sample Anderson-Darling test…

Now in res you have both group name and the result of the test (a full test-result object that you can work with later).

If you are interested only in p-value do:

julia> res = combine(gdf, :x => (x -> pvalue(OneSampleADTest(x, Normal()))) => :ADTest_pvalue)
2×2 DataFrame
 Row │ group   ADTest_pvalue
     │ String  Float64
─────┼───────────────────────
   1 │ A            0.469626
   2 │ B            0.750134

If you are used to dplyr style use DataFramesMeta.jl:

julia> using DataFramesMeta

julia> @combine(gdf, :ADTest = OneSampleADTest(:x, Normal()))
2×2 DataFrame
 Row │ group   ADTest
     │ String  OneSampl…
─────┼───────────────────────────────────────────
   1 │ A       One sample Anderson-Darling test…
   2 │ B       One sample Anderson-Darling test…

julia> @combine(gdf, :ADTest_pvalue = pvalue(OneSampleADTest(:x, Normal())))
2×2 DataFrame
 Row │ group   ADTest_pvalue
     │ String  Float64
─────┼───────────────────────
   1 │ A            0.469626
   2 │ B            0.750134

Upvotes: 3

Related Questions