Reputation: 4102
I have a data set of values (val
) grouped by multiple categories (distance
& phase
). I would like to test each category by Kruskal-Wallis test
, where val
is dependent variable, distance
is a factor, and phase
split my data in 3 groups.
As such, I need to specify the subset of the data within Kruskal-Wallis test and then apply the test to each of groups. BUT, I can not get my subsetting to work!
In R help, it is specified that the subset
is an optional vector specifying a subset of observations to be used.
But how to correctly put this to my lapply
function?
My dummy data:
# create data
val<-runif(60, min = 0, max = 100)
distance<-floor(runif(60, min=1, max=3))
phase<-rep(c("a", "b", "c"), 20)
df<-data.frame(val, distance, phase)
# get unique groups
ii<-unique(df$phase)
# get basic statistics per group
aggregate(val ~ distance + phase, df, mean)
# run Kruskal test, specify the subset
kruskal.test(df$val ~df$distance,
subset = phase == "c")
This works well, so my subset should be correctly set as a vector.
But how to use this in a lapply
function?
# DOES not work!!
lapply(ii, kruskal.test(df$val ~ df$distance,
subset = df$phase == as.character(ii)))
My overall goal is to create a function from kruskal.test
, and save all statistics for each group into one table.
All help is highly appreciated.
Upvotes: 2
Views: 1798
Reputation: 8200
Though it is late, it might help someone having the same problem. So, I am putting an answer implemented using tidyverse
and rstatix
packages. The rstatix
package which "provides a simple and intuitive pipe friendly framework, coherent with the 'tidyverse' design philosophy for performing basic statistical tests".
library(rstatix)
library(tidyverse)
df %>%
group_by(phase) %>%
kruskal_test(val ~ distance)
Output
# A tibble: 3 x 7
phase .y. n statistic df p method
* <chr> <chr> <int> <dbl> <int> <dbl> <chr>
1 a val 20 0.230 1 0.631 Kruskal-Wallis
2 b val 20 0.0229 1 0.88 Kruskal-Wallis
3 c val 20 0.322 1 0.570 Kruskal-Wallis
which is same as provided by @user295691. Data
df = structure(list(val = c(93.8056977232918, 31.0681172646582, 40.5262873973697,
47.6368983509019, 65.23181500379, 64.4571609096602, 10.3301600087434,
90.4661140637472, 41.2359046051279, 28.3357713604346, 49.8977075796574,
10.8744730940089, 5.31001624185592, 71.9248640118167, 99.0267782937735,
73.7928744405508, 3.31214582547545, 40.2693636715412, 27.6980920461938,
79.501334275119, 60.5167196830735, 89.9171086261049, 87.4633299885318,
43.1893823202699, 91.1248738644645, 99.755659350194, 7.25280269980431,
96.957387868315, 75.0860505970195, 52.3794749286026, 26.6221587313339,
52.5518182432279, 24.1361060412601, 49.5364486705512, 65.5214034719393,
38.9469220302999, 0.687191751785576, 19.3090825574473, 19.6511475136504,
25.5966754630208, 7.33999472577125, 33.9820940745994, 50.3751677693799,
10.811762069352, 17.2359711956233, 53.958406439051, 64.2723652534187,
92.7404976682737, 26.824192632921, 30.0975760444999, 52.0105463219807,
74.4495407678187, 56.0636054025963, 91.891074879095, 14.0827904455364,
59.3607738381252, 66.5170294465497, 24.1726311156526, 83.0881901318207,
35.5380675755441), distance = c(2, 1, 1, 1, 1, 2, 1, 2, 2, 1,
2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1,
1, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1,
1, 2, 1, 1, 2, 2, 2, 2), phase = c("a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a",
"b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b",
"c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c",
"a", "b", "c", "a", "b", "c", "a", "b", "c", "a", "b", "c", "a",
"b", "c")), class = "data.frame", row.names = c(NA, -60L))
Upvotes: 2
Reputation: 7248
Usually you would start by split
ting, and then lapply
ing.
Something like
lapply(split(df, df$phase), function(d) { kruskal.test(val ~ distance, data=d) })
would yield a list, indexed by the phase, of the results of kruskal.test.
Your final expression does not work because lapply expects a function, and applying kruskal.test
does not result in a function, it results in the result of running that test. If you surround it with a function definition with the index, then it would work, just be a little less idiomatic.
lapply(ii, function(i) { kruskal.test(df$val ~ df$distance, subset=df$phase==i )})
Upvotes: 4