Reputation: 1102
Say I have this data frame:
df <- structure(list(q1 = structure(c(2L, 2L, 4L,
3L, 1L, 4L), .Label = c("I dont like\na thing",
"I really dont like\nthat thing", "I like a\nthing",
"Ambivalent\nabout the thing"), class = "factor"), q2 = structure(c(3L,
2L, 1L, 1L, 4L, 1L), .Label = c("Neither like\nnor dislike",
"Somewhat\ndislike", "Somewhat\nlike", "Strongly\ndislike", "Strongly\nlike"
), class = "factor")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
I can run the below dplyr chunk without any problem:
df %>%
summarise(question = 'q1',
n = sum(!is.na(q1)),
mean = mean(as.numeric(q1), na.rm = T),
sd = sd(as.numeric(q1), na.rm = T),
se = sd/sqrt(n),
ci_lo = mean - qnorm(1 - (.05/2))*se, # qnorm() provides the specified Z-score
ci_hi = mean + qnorm(1 - (.05/2))*se,
min = min(as.integer(q1)),
max = max(as.integer(q1)))
# A tibble: 1 x 9
question n mean sd se ci_lo ci_hi min max
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 q1 6 2.67 1.21 0.494 1.70 3.64 1 4
But if I try and put this inside a lapply()
function and call it on all column names in a list, it returns a bunch of NaN
and NA
values.
summary_stats <- function(question){
df %>%
summarise(question = question,
n = sum(!is.na(question)),
mean = mean(as.numeric(question), na.rm = T),
sd = sd(as.numeric(question), na.rm = T),
se = sd/sqrt(n),
ci_lo = mean - qnorm(1 - (.05 / 2)) * se, # qnorm() provides the specified Z-score
ci_hi = mean + qnorm(1 - (.05 / 2)) * se,
min = min(as.numeric(question)),
max = max(as.numeric(question)))
}
colnames <-
df %>%
select(starts_with("q")) %>%
colnames
lapply(colnames, summary_stats)
[[1]]
# A tibble: 1 x 9
question n mean sd se ci_lo ci_hi min max
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 q1 1 NaN NA NA NaN NaN NA NA
[[2]]
# A tibble: 1 x 9
question n mean sd se ci_lo ci_hi min max
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 q2 1 NaN NA NA NaN NaN NA NA
Warning messages:
1: In mean(as.integer(question), na.rm = T) : NAs introduced by coercion
2: In is.data.frame(x) : NAs introduced by coercion
3: In mask$eval_all_summarise(quo) : NAs introduced by coercion
4: In mask$eval_all_summarise(quo) : NAs introduced by coercion
5: In mean(as.integer(question), na.rm = T) : NAs introduced by coercion
6: In is.data.frame(x) : NAs introduced by coercion
7: In mask$eval_all_summarise(quo) : NAs introduced by coercion
8: In mask$eval_all_summarise(quo) : NAs introduced by coercion
Does anyone know where i'm going wrong here? I'd also like to return a single tibble with one row per column fed through to the lapply
function, instead of one tbl_df per column. Is that possible?
Upvotes: 0
Views: 368
Reputation: 389135
You are passing column names to the function whereas the functions expects column data.
Here is an alternative way -
library(dplyr)
library(purrr)
summary_stats <- function(data){
tibble(n = sum(!is.na(data)),
mean = mean(as.numeric(data), na.rm = T),
sd = sd(as.numeric(data), na.rm = T),
se = sd/sqrt(n),
ci_lo = mean - qnorm(1 - (.05 / 2)) * se,
ci_hi = mean + qnorm(1 - (.05 / 2)) * se,
min = min(as.numeric(data)),
max = max(as.numeric(data)))
}
map_df(df %>% select(starts_with('q')), summary_stats, .id = 'question')
# question n mean sd se ci_lo ci_hi min max
# <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 q1 6 2.67 1.21 0.494 1.70 3.64 1 4
#2 q2 6 2 1.26 0.516 0.988 3.01 1 4
Upvotes: 1