Reputation: 93
Today I began working with purrr functions so I can try and use R from a more functional approach. I currently have a dataframe that contains a response variable with a lot of other variables. My goal is to split the dataframe by the levels in the response column, and then run shapiro.test() on all of the split dataframes.
For example, this code works:
# fake data
df = data.frame(y = c(rep(1,10), rep(2, 10)),
a = rnorm(20),
b = runif(20),
c = rnorm(20))
df$y <- factor(df$y)
df %>%
select(y, a) %>%
split(.$y) %>%
map(~shapiro.test(.x$a))
And this returns:
$`1`
Shapiro-Wilk normality test
data: .x$a
W = 0.93455, p-value = 0.4941
$`2`
Shapiro-Wilk normality test
data: .x$a
W = 0.7861, p-value = 0.009822
So this works as I want it to on an individual column, but I would like it to run on a given vector of any columns. My thinking right now is to create a vector of the column names I want to run and use that in a map(). I think I'm pretty close to having this right, but I'm just a little stuck.
# Function that splits the df into two groups based on y levels and run shapiro test on the split dfs
shapiro <- function(var) {
df_list = df %>%
select(y, var) %>%
split(.$y) %>%
map(~shapiro.test(.x$var))
return(df_list)
}
This fails:
> shapiro(a)
Error in .f(.x[[i]], ...) : object 'a' not found
Which makes sense since a is not saved in the environment. This is sort of the direction I envision it to, but I don't know if there's a better way to go about it.
# the column names I want the function to take
columns = c(a, b, c)
# map it
map(columns, shapiro)
However, this gives an error since the column names aren't in the environment. Does anyone have suggestions on how to fix this or improve it?
Thanks!
Upvotes: 1
Views: 173
Reputation: 16842
If you want to do this with a function, you'll likely need to get into tidyeval, like @MauritsEvers answer. For a relatively small task like this, you could instead get away with a couple map
calls. Map over the list of data frames created by splitting by y
, then use map_at
to apply the test to the columns of your choice.
In the first method, you end up with some excess—any columns not in the map_at
are just hanging there. The cleaner way is to select the columns you want, and then map
over all columns to apply the test.
library(tidyverse)
test_list1 <- df %>%
split(.$y) %>%
map(function(split_by_y) {
split_by_y %>%
map_at(vars(a, b, c), shapiro.test)
})
test_list2 <- df %>%
split(.$y) %>%
map(function(split_by_y) {
split_by_y %>%
select(a, b, c) %>%
map(shapiro.test)
})
test_list2[[2]]$a
#>
#> Shapiro-Wilk normality test
#>
#> data: .x[[i]]
#> W = 0.95281, p-value = 0.7018
Created on 2019-03-05 by the reprex package (v0.2.1)
Upvotes: 1
Reputation: 50678
Here is a tidyverse
way with three corrections/improvements:
shapiro(a)
, you provide the column as a symbol, so we need to make sure that a
is properly quoted and then later un-quoted to adhere to dplyr
s non-standard evaluation.split
a more tidyverse
-consistent approach is to use nest
.df
a function argument of shapiro
, thereby avoiding the dependence on a global variable.This is the improved version
shapiro <- function(df, var) {
var <- enquo(var)
df_list <- df %>%
select(y, !!var) %>%
group_by(y) %>%
nest() %>%
mutate(test = map(setNames(data, y), ~shapiro.test(.x[[1]]))) %>%
pull(test)
return(df_list)
}
So for column df$a
shapiro(df, a)
#$`1`
#
# Shapiro-Wilk normality test
#
#data: .x[[1]]
#W = 0.93049, p-value = 0.4527
#
#
#$`2`
#
# Shapiro-Wilk normality test
#
#data: .x[[1]]
#W = 0.9268, p-value = 0.4171
and for column df$b
shapiro(df, b)
#$`1`
#
# Shapiro-Wilk normality test
#
#data: .x[[1]]
#W = 0.90313, p-value = 0.237
#
#
#$`2`
#
# Shapiro-Wilk normality test
#
#data: .x[[1]]
#W = 0.88552, p-value = 0.1509
Upvotes: 2
Reputation: 120
You can append the results to a list using a for loop:
shapiro <- function(var) {
myList = list()
for (i in 1:length(var)) {
myList[[i]] = df %>%
select(y, var = var[i]) %>%
split(.$y) %>%
map(~shapiro.test(.x$var))
}
return(myList)
}
Just make sure to use a character vector for the columns:
shapiro(c("a", "b"))
Upvotes: 0