Alex Nesta
Alex Nesta

Reputation: 413

grouped statistical test tidyverse

I'm trying to do a Wilcoxon test on long-formatted data. I want to use dplyr::group_by() to specify the subsets I'd like to do the test on.

The final result would be a new column with the p-value of the Wilcoxon test appended to the original data frame. All of the techniques I have seen require summarizing the data frame. I DO NOT want to summarize the data frame.

Please see an example reformatting the iris dataset to mimic my data, and finally my attempts to perform the task.

I am getting close, but I want to preserve all of my original data from before the Wilcoxon test.

# Reformatting Iris to mimic my data.
long_format <- iris %>% 
  gather(key = "attribute", value = "measurement", -Species) %>%
  mutate(descriptor = 
           case_when(
    str_extract(attribute, pattern = "\\.(.*)") == ".Width" ~ "Width",
    str_extract(attribute, pattern = "\\.(.*)") == ".Length" ~ "Length")) %>%
  mutate(Feature = 
           case_when(
    str_extract(attribute, pattern = "^(.*?)\\.") == "Sepal." ~ "Sepal",
    str_extract(attribute, pattern = "^(.*?)\\.") == "Petal." ~ "Petal"))

# Removing no longer necessary column.
cleaned_up <- long_format %>% select(-attribute)

# Attempt using do(), but I lose important info like "measurement"
cleaned_up %>%
  group_by(Species, Feature) %>%
  do(w = wilcox.test(measurement~descriptor, data=., paired=FALSE)) %>% 
  mutate(Wilcox = w$p.value)

# This is an attempt with the dplyr experimental group_map function. If only I could just make this a new column appended to the original df in one step.

cleaned_up %>%
  group_by(Species, Feature) %>%
  group_map(~ wilcox.test(measurement~descriptor, data=., paired=FALSE)$p.value)

Thanks for your help.

Upvotes: 1

Views: 455

Answers (2)

IceCreamToucan
IceCreamToucan

Reputation: 28675

Another option is to avoid the data argument entirely. The wilcox.test function only requires a data argument when the variables being tested aren't in the calling scope, but functions called within mutate have all the columns from the data frame in scope.

cleaned_up %>%
  group_by(Species, Feature) %>%
  mutate(pval = wilcox.test(measurement~descriptor, paired=FALSE)$p.value)

Same as akrun's output (thanks to his correction in the comments above)

akrun <- 
  cleaned_up %>% 
    group_split(Species, Feature) %>%
    map_dfr(~ .x %>%
                 mutate(pval = wilcox.test(measurement~descriptor, 
               data=., paired=FALSE)$p.value))

me <- 
cleaned_up %>%
  group_by(Species, Feature) %>%
  mutate(pval = wilcox.test(measurement~descriptor, paired=FALSE)$p.value)

all.equal(akrun, me)
# [1] TRUE

Upvotes: 2

akrun
akrun

Reputation: 886938

The model object can be wrapped in a list

library(tidyverse)
cleaned_up %>%
   group_by(Species, Feature) %>%
   nest %>% 
   mutate(model = map(data, ~ 
          .x %>%
           transmute(w = list(wilcox.test(measurement~descriptor, 
               data=., paired=FALSE)))))

Or another option is group_split into a list, then map through the list, elements create the 'pval' column after applying the model

cleaned_up %>% 
    group_split(Species, Feature) %>%
    map_dfr(~ .x %>%
                 mutate(pval = wilcox.test(measurement~descriptor, 
               data=., paired=FALSE)$p.value))

Upvotes: 3

Related Questions