Reputation: 93

Using dataframe column names inside select statement inside function for use with map()

Today I began working with purrr functions so I can try and use R from a more functional approach. I currently have a dataframe that contains a response variable with a lot of other variables. My goal is to split the dataframe by the levels in the response column, and then run shapiro.test() on all of the split dataframes.

For example, this code works:

# fake data 
df = data.frame(y = c(rep(1,10), rep(2, 10)), 
                a = rnorm(20),
                b = runif(20), 
                c = rnorm(20))

df$y <- factor(df$y)    

df %>% 
    select(y, a) %>% 
    split(.$y) %>% 
    map(~shapiro.test(.x$a))

And this returns:

$`1`

    Shapiro-Wilk normality test

data:  .x$a
W = 0.93455, p-value = 0.4941


$`2`

    Shapiro-Wilk normality test

data:  .x$a
W = 0.7861, p-value = 0.009822

So this works as I want it to on an individual column, but I would like it to run on a given vector of any columns. My thinking right now is to create a vector of the column names I want to run and use that in a map(). I think I'm pretty close to having this right, but I'm just a little stuck.

# Function that splits the df into two groups based on y levels and run shapiro test on the split dfs
shapiro <- function(var) {
  df_list = df %>% 
    select(y, var) %>% 
    split(.$y) %>% 
    map(~shapiro.test(.x$var))
  return(df_list)
}

This fails:

> shapiro(a)
Error in .f(.x[[i]], ...) : object 'a' not found

Which makes sense since a is not saved in the environment. This is sort of the direction I envision it to, but I don't know if there's a better way to go about it.

# the column names I want the function to take
columns = c(a, b, c)

# map it
map(columns, shapiro)

However, this gives an error since the column names aren't in the environment. Does anyone have suggestions on how to fix this or improve it?

Thanks!

Upvotes: 1

Answers (3)

camille

Reputation: 16842

If you want to do this with a function, you'll likely need to get into tidyeval, like @MauritsEvers answer. For a relatively small task like this, you could instead get away with a couple map calls. Map over the list of data frames created by splitting by y, then use map_at to apply the test to the columns of your choice.

In the first method, you end up with some excess—any columns not in the map_at are just hanging there. The cleaner way is to select the columns you want, and then map over all columns to apply the test.

library(tidyverse)

test_list1 <- df %>%
  split(.$y) %>%
  map(function(split_by_y) {
    split_by_y %>%
      map_at(vars(a, b, c), shapiro.test)
  })

test_list2 <- df %>%
  split(.$y) %>%
  map(function(split_by_y) {
    split_by_y %>%
      select(a, b, c) %>%
      map(shapiro.test)
  })

test_list2[[2]]$a
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  .x[[i]]
#> W = 0.95281, p-value = 0.7018

^{Created on 2019-03-05 by the reprex package (v0.2.1)}

Upvotes: 1

Maurits Evers

Reputation: 50678

Here is a tidyverse way with three corrections/improvements:

In your example call shapiro(a), you provide the column as a symbol, so we need to make sure that a is properly quoted and then later un-quoted to adhere to dplyrs non-standard evaluation.
Instead of split a more tidyverse-consistent approach is to use nest.
Lastly, I would recommend making df a function argument of shapiro, thereby avoiding the dependence on a global variable.

This is the improved version

shapiro <- function(df, var) {
  var <- enquo(var)
  df_list <- df %>%
      select(y, !!var) %>%
      group_by(y) %>%
      nest() %>%
      mutate(test = map(setNames(data, y), ~shapiro.test(.x[[1]]))) %>%
      pull(test)
  return(df_list)
}

So for column df$a

shapiro(df, a)
#$`1`
#
#   Shapiro-Wilk normality test
#
#data:  .x[[1]]
#W = 0.93049, p-value = 0.4527
#
#
#$`2`
#
#   Shapiro-Wilk normality test
#
#data:  .x[[1]]
#W = 0.9268, p-value = 0.4171

and for column df$b

shapiro(df, b)
#$`1`
#
#   Shapiro-Wilk normality test
#
#data:  .x[[1]]
#W = 0.90313, p-value = 0.237
#
#
#$`2`
#
#   Shapiro-Wilk normality test
#
#data:  .x[[1]]
#W = 0.88552, p-value = 0.1509

Upvotes: 2

J.Moon

Reputation: 120

You can append the results to a list using a for loop:

shapiro <- function(var) {
   myList = list()
   for (i in 1:length(var)) {
     myList[[i]] = df %>% 
     select(y, var = var[i]) %>% 
     split(.$y) %>% 
     map(~shapiro.test(.x$var))
   }
   return(myList)
}

Just make sure to use a character vector for the columns:

shapiro(c("a", "b"))

Upvotes: 0

Using dataframe column names inside select statement inside function for use with map()

Answers (3)

Related Questions