RmyjuloR
RmyjuloR

Reputation: 438

R: Kruskal-Wallis test in loop over specified columns in data frame

I would like to run a KW-test over certain numerical variables from a data.frame, using one grouping variable. I'd prefer to do this in a loop, instead of typing out all the tests, as they are many variables (more than in the example below).

Simulated data:

library(dplyr)
set.seed(123)
Data <- tbl_df(
data.frame(
muttype = as.factor(rep(c("missense", "frameshift", "nonsense"), each = 80)),
ados.tsc   = runif(240, 0, 10),
ados.sa    = runif(240, 0, 10),
ados.rrb   = runif(240, 0, 10))
) %>%
group_by(muttype)
ados.sim <- as.data.frame(Data)

The following code works just fine outside of the loop.

kruskal.test(formula (paste((colnames(ados.sim)[2]), "~ muttype")), data = 
ados.sim)

But it doesn't inside the loop:

for(i in names(ados.sim[,2:4])){  
ados.mtp <- kruskal.test(formula (paste((colnames(ados.sim)[i]), "~ muttype")), 
data = ados.sim)
}

I get the error:

Error in terms.formula(formula, data = data) : invalid term in model formula

Anybody who knows how to solve this?

Upvotes: 3

Views: 5231

Answers (2)

Karolis Koncevičius
Karolis Koncevičius

Reputation: 9658

If you can use external libraries:

library(matrixTests)

col_kruskalwallis(ados.sim[,-1], ados.sim[,1])

         obs.tot obs.groups df statistic    pvalue
ados.tsc     240          3  2 0.3618102 0.8345146
ados.sa      240          3  2 0.1319554 0.9361517
ados.rrb     240          3  2 2.2205031 0.3294761

Upvotes: 0

coffeinjunky
coffeinjunky

Reputation: 11514

Try:

results <- list()
for(i in names(ados.sim[,2:4])){  
  results[[i]] <- kruskal.test(formula(paste(i, "~ muttype")), data = ados.sim)
}

This also saves your results in a list and avoids overwriting your results as ados.mtp in every iteration, which I think is not what you intended to do.

Note the following:

for(i in names(ados.sim[,2:4])){  
   print(i)
}
[1] "ados.tsc"
[1] "ados.sa"
[1] "ados.rrb"

That is, i already gives you the name of the column. The problem in your code was that you tried to use it like an integer for subsetting, which turned the outcome into NA.

for(i in names(ados.sim[,2:4])){  
   print(paste((colnames(ados.sim)[i]), "~ muttype"))
}
[1] "NA ~ muttype"
[1] "NA ~ muttype"
[1] "NA ~ muttype"

And just for reference, all of this could also be done in the following two ways that I often prefer since it makes subsequent analysis slightly easier:

First, store all test objects in a dataframe:

library(tidyr)
df <- ados.sim %>% gather(key, value, -muttype) %>% 
      group_by(key) %>% 
      do(test = kruskal.test(x= .$value, g = .$muttype))

You can then subset the dataframe to get the test outcomes:

df[df$key == "ados.rrb",]$test
[[1]]

    Kruskal-Wallis rank sum test

data:  .$value and .$muttype
Kruskal-Wallis chi-squared = 2.2205, df = 2, p-value = 0.3295

Alternatively, get all results directly in a dataframe, without storing the test objects:

library(broom)
df2 <- ados.sim %>% gather(key, value, -muttype) %>% 
       group_by(key) %>% 
       do(tidy(kruskal.test(x= .$value, g = .$muttype)))
df2
# A tibble: 3 x 5
# Groups:   key [3]
       key statistic   p.value parameter                       method
     <chr>     <dbl>     <dbl>     <int>                       <fctr>
1 ados.rrb 2.2205031 0.3294761         2 Kruskal-Wallis rank sum test
2  ados.sa 0.1319554 0.9361517         2 Kruskal-Wallis rank sum test
3 ados.tsc 0.3618102 0.8345146         2 Kruskal-Wallis rank sum test

Upvotes: 5

Related Questions