Michael
Michael

Reputation: 111

Loop for Selecting and Summarising Each Column for Later Permutation

I have a dataset similar to the one below. The idea is I need to use a loop to do a permutation test for mean differences. My primary issue is I have to loop through columns in the dataset and I don't know how.

df = data.frame(matrix(rnorm(10), nrow=5)) 
category <- rep(c("good", "bad"), c(2, 3))
id <- c(1, 2, 3, 4, 5)
df <- cbind(id, df, category)

  id         X1         X2 category 
1  1  0.5584823 -2.3135133     good     
2  2 -0.1115585  0.4731869     good     
3  3 -0.7435472 -0.0231894      bad      
4  4 -0.6673812  0.7470000      bad      
5  5 -1.2959973  0.4255970      bad      

So I need to basically do this in the loops:

merged_df %>% filter(category == "bad") %>% select(X1) %>% summarise(mean_X_bad = mean(X1))
merged_df %>% filter(category == "good") %>% select(X2) %>% summarise(mean_X_good = mean(X1))

For both X1 and X2 (and 98 other X variables not shown here).

So for each X from 1 to 100 I will have to get the mean of X in group = good and the mean of X in group = bad so that I can run a loop for permutation of mean differences in the value of X between the groups for all X.

I don't know how to run a loop that selects the column and maps it to the category and returns the mean of that subset. I assume in order for the permutation to be performed I need a vector of the "good" means and the "bad" means to compare. So I guess that has to be the result of the first loop?

Upvotes: 1

Views: 72

Answers (2)

akrun
akrun

Reputation: 887078

If we want to loop, then use map2. Based on the OP's code, we are filtering the 'bad', 'good' and selecting columns 'X1', 'X2'. So, pass these as two vectors in map2, filter, select the dataset, and summarise the mean of the selected column with a new name

library(tidyverse)
map2(c("bad", "good"), c("X1", "X2"), ~ 
     df %>% 
       filter(category == .x) %>% 
       select(.y) %>%
       summarise(!! paste0("mean_X_", .x) := mean(!! rlang::sym(.y))))
#[[1]]
#  mean_X_bad
#1 -0.4954794

#[[2]]
#  mean_X_good
#1   0.7497338

Instead of filtering by 'category, it can be grouped and then use summarise_at

df %>%
   group_by(category) %>%
   summarise_at(vars(matches("^X\\d+$")), mean)
# A tibble: 2 x 3
#  category       X1     X2
#  <fct>       <dbl>  <dbl>
#1 bad       0.228   -0.438
#2 good     -0.00465  0.355

and that gives the same output without any gathering (only the results are transposed in gathering)

 df %>%
    gather(key = "variable", value = "value", -id, -category) %>%
    group_by(category, variable) %>%
   summarise(mean = mean(value))
# A tibble: 4 x 3
# Groups:   category [2]
#  category variable     mean
#  <fct>    <chr>       <dbl>
#1 bad      X1        0.228  
#2 bad      X2       -0.438  
#3 good     X1       -0.00465
#4 good     X2        0.355  

data

set.seed(24)
df = data.frame(matrix(rnorm(10), nrow=5)) 
category <- rep(c("good", "bad"), c(2, 3))
id <- c(1, 2, 3, 4, 5)
df <- cbind(id, df, category)

Upvotes: 0

Marian Minar
Marian Minar

Reputation: 1456

Gather your data first (make it "long" instead of "wide") by using tidyr::gather, then summarise by grouping the categories and variables:

library(tidyverse)

df %>%
  gather(key = "variable", value = "value", -id, -category) %>%
  group_by(category, variable) %>%
  summarise(mean = mean(value))

Here's the output:

# A tibble: 4 x 3
# Groups:   category [2]
  category variable    mean
  <fct>    <chr>      <dbl>
1 bad      X1       -0.323 
2 bad      X2        0.342 
3 good     X1        0.0793
4 good     X2        0.632 

Upvotes: 1

Related Questions