89_Simple
89_Simple

Reputation: 3805

R: Apply function to calculate mean of a single column of dataframe across a list

Some sample data

I have three lists

    loc <- c("A","A","A","B","B","B")
    sub.loc <- c(1,2,3,1,2,3)

    set.seed(123)

    df1 <- as.data.frame(cbind(loc,sub.loc, round(rnorm(6),digits =2)))
    df2 <- as.data.frame(cbind(loc,sub.loc, round(rnorm(6),digits =2)))
    df3 <- as.data.frame(cbind(loc,sub.loc, round(rnorm(6),digits =2)))

    list.name <- list(df1,df2,df3)

I want to produce a single file that has mean and sd of the third column V3.

Something like: 

    loc    sub.loc        V3                                      v4
    A        1        mean(c(-0.56,0.46,0.4))      sd(c(-0.56,0.46,0.4)) 
    A        2        mean(c(-0.23,-1.27,0.11))    sd(c(-0.23,-1.27,0.11))
    A        3        mean(c(-0.56,-0.69, 1.56))   sd(c(-0.56,-0.69, 1.56))
    B        1       mean(c(0.07,-0.45,1.79))      sd(c(0.07,-0.45,1.79))
    B        2        mean(c(0.13,1.22,0.5))       sd(c(0.13,1.22,0.5))
    B        3        mean(c(1.72,0.36,-1.97))     sd(c(1.72,0.36,-1.97))

My actual data in column `V3`` has NAs

I thought of using lapply

    lapply(list.name, function(x) mean(x, na.rm = T))

    lapply(list.name, function(x) sd(x, na.rm = T))

But both of them give me NAs

Upvotes: 1

Views: 1396

Answers (2)

Parfait
Parfait

Reputation: 107567

Currently, you are running mean() and sd() across entire dataframe which contains non-numeric columns and hence NA.

Consider base R's by (object-oriented wrapper of tapply) where you first stack list of dataframes with rbind and then run aggregations across loc and sub.loc groupings:

stackdf <- do.call(rbind, list.name)
stackdf
#    loc sub.loc    V3
# 1    A       1 -0.56
# 2    A       2 -0.23
# 3    A       3  1.56
# 4    B       1  0.07
# 5    B       2  0.13
# 6    B       3  1.72
# 7    A       1  0.46
# 8    A       2 -1.27
# 9    A       3 -0.69
# 10   B       1 -0.45
# 11   B       2  1.22
# 12   B       3  0.36
# 13   A       1  0.40
# 14   A       2  0.11
# 15   A       3 -0.56
# 16   B       1  1.79
# 17   B       2  0.50
# 18   B       3 -1.97

dfs <- by(stackdf, stackdf[c("loc", "sub.loc")], FUN=function(df) {      
  data.frame(loc = df$loc[1],
             sub.loc = df$sub.loc[1],
             mean = mean(df$V3, na.rm=TRUE),
             sd = sd(df$V3, na.rm=TRUE))
})

finaldf <- do.call(rbind, dfs)
finaldf
#   loc sub.loc        mean        sd
# 1   A       1  0.10000000 0.5723635
# 2   B       1  0.47000000 1.1723481
# 3   A       2 -0.46333333 0.7189808
# 4   B       2  0.61666667 0.5542863
# 5   A       3  0.10333333 1.2631838
# 6   B       3  0.03666667 1.8661279

Upvotes: 0

MrFlick
MrFlick

Reputation: 206167

This can be done with dplyr. First, I'm not sure how accurate your sample data above matches your real data but right now all your "numeric" values are factors. You really shouldn't use cbind() inside as.data.frame(), you can leave it out.

But with your example data above, we can stack the data into one larger data.frame and then do a simple group_by to get the values you want

library(dplyr)
bind_rows(list.name, .id="from") %>% 
  mutate(V3=as.numeric(as.character(V3))) %>%  # fix the factors from the sample
  group_by(loc, sub.loc) %>% 
  summarize(mean=mean(V3, na.rm=T), sd=sd(V3, na.rm=T))

Upvotes: 1

Related Questions