Calling summary function inside lapply returning NaN values

Question

Say I have this data frame:

df <- structure(list(q1 = structure(c(2L, 2L, 4L, 
3L, 1L, 4L), .Label = c("I dont like
a thing", 
"I really dont like
that thing", "I like a
thing", 
"Ambivalent
about the thing"), class = "factor"), q2 = structure(c(3L, 
2L, 1L, 1L, 4L, 1L), .Label = c("Neither like
nor dislike", 
"Somewhat
dislike", "Somewhat
like", "Strongly
dislike", "Strongly
like"
), class = "factor")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

I can run the below dplyr chunk without any problem:

df %>%
    summarise(question = 'q1',
              n = sum(!is.na(q1)), 
              mean = mean(as.numeric(q1), na.rm = T), 
              sd = sd(as.numeric(q1), na.rm = T), 
              se = sd/sqrt(n), 
              ci_lo = mean - qnorm(1 - (.05/2))*se,  # qnorm() provides the specified Z-score
              ci_hi = mean + qnorm(1 - (.05/2))*se,
              min = min(as.integer(q1)),
              max = max(as.integer(q1)))


# A tibble: 1 x 9
  question     n  mean    sd    se ci_lo ci_hi   min   max
             
1 q1           6  2.67  1.21 0.494  1.70  3.64     1     4

But if I try and put this inside a lapply() function and call it on all column names in a list, it returns a bunch of NaN and NA values.

summary_stats <- function(question){
  df %>%
    summarise(question = question,
              n = sum(!is.na(question)), 
              mean = mean(as.numeric(question), na.rm = T), 
              sd = sd(as.numeric(question), na.rm = T), 
              se = sd/sqrt(n), 
              ci_lo = mean - qnorm(1 - (.05 / 2)) * se,  # qnorm() provides the specified Z-score
              ci_hi = mean + qnorm(1 - (.05 / 2)) * se,
              min = min(as.numeric(question)),
              max = max(as.numeric(question))) 
}

colnames <- 
  df %>% 
  select(starts_with("q")) %>% 
  colnames

lapply(colnames, summary_stats)


[[1]]
# A tibble: 1 x 9
  question     n  mean    sd    se ci_lo ci_hi   min   max
             
1 q1           1   NaN    NA    NA   NaN   NaN    NA    NA

[[2]]
# A tibble: 1 x 9
  question     n  mean    sd    se ci_lo ci_hi   min   max
             
1 q2           1   NaN    NA    NA   NaN   NaN    NA    NA

Warning messages:
1: In mean(as.integer(question), na.rm = T) : NAs introduced by coercion
2: In is.data.frame(x) : NAs introduced by coercion
3: In mask$eval_all_summarise(quo) : NAs introduced by coercion
4: In mask$eval_all_summarise(quo) : NAs introduced by coercion
5: In mean(as.integer(question), na.rm = T) : NAs introduced by coercion
6: In is.data.frame(x) : NAs introduced by coercion
7: In mask$eval_all_summarise(quo) : NAs introduced by coercion
8: In mask$eval_all_summarise(quo) : NAs introduced by coercion

Does anyone know where i'm going wrong here? I'd also like to return a single tibble with one row per column fed through to the lapply function, instead of one tbl_df per column. Is that possible?

Ronak Shah · Accepted Answer

You are passing column names to the function whereas the functions expects column data.

Here is an alternative way -

library(dplyr)
library(purrr)

summary_stats <- function(data){
       tibble(n = sum(!is.na(data)), 
              mean = mean(as.numeric(data), na.rm = T), 
              sd = sd(as.numeric(data), na.rm = T), 
              se = sd/sqrt(n), 
              ci_lo = mean - qnorm(1 - (.05 / 2)) * se,
              ci_hi = mean + qnorm(1 - (.05 / 2)) * se,
              min = min(as.numeric(data)),
              max = max(as.numeric(data))) 
}

map_df(df %>% select(starts_with('q')), summary_stats, .id = 'question')

#  question     n  mean    sd    se ci_lo ci_hi   min   max
#             
#1 q1           6  2.67  1.21 0.494 1.70   3.64     1     4
#2 q2           6  2     1.26 0.516 0.988  3.01     1     4

Calling summary function inside lapply returning NaN values

Answers (1)

Related Questions