Aditya K
Aditya K

Reputation: 11

How to convert an output of "by" function to a data frame in R?

Here is what I want to Do: I have a dataframe df defined as:

col1 <- c("a","a","a","a","a","a","b","b","b","b","b","b")
col2 <- c("z","z","x","x","z","x", "z","z","x","x","z","x")
col3 <- c(1,2,3,4,5,6,7,8,9,10,11,12)
df <- data.frame(col1,col2,col3)

and a function pred that calculates the mean defined as :

pred <- function(subset_df){return(mean(subset_df$col3))}

I want a data frame through a by function in a below format:

col1 col2 col3_mean
a     x    4.33
a     z    2.66
b     x    10.33
b     z    8.66

I am currently using a by() function to partition this data into its strata and apply a pred() function that calculates a mean

by_keys <- c("col1","col2")
data_sub <- by(df, data_sub[,by_keys], pred)  
data_sub <- do.call(rbind, data_sub)

I am getting an error here saying the "Error in do.call(rbind, data_sub) : second argument must be a list"

I tried a solution from a similar tread but I dont get col1 and col2 as in desired format

as.data.frame(vapply(data_sub,unlist,unlist(data_sub[[1]])))

Would appreciate any help on this.

Upvotes: 1

Views: 1380

Answers (2)

Parfait
Parfait

Reputation: 107697

Indeed, by as you set up will not return a list but a simplified structure since your output returns numeric vectors. Adjust your pred function to return data frames which being non-simplified structures will force by to return a list and can then be passed into do.call.

pred <- function(subset_df){    
  df <- data.frame(col1 = subset_df$col1[[1]], 
                   col2 = subset_df$col2[[1]],
                   col3_mean = mean(subset_df$col3)
                  )                      
  return(df)
}

data_sub_list <- by(df, df[,by_keys], pred)  
data_sub <- do.call(rbind, data_sub_list)
data_sub

#   col1 col2 col3_mean
# 1    a    x  4.333333
# 2    b    x 10.333333
# 3    a    z  2.666667
# 4    b    z  8.666667

However, as commented by @Onyambu, this type of grouped aggregation can be done with aggregate which will return data frames.

# FORMULA VERSION
aggregate(col3 ~ col1 + col2, df, mean)
#   col1 col2 col3_mean
# 1    a    x  4.333333
# 2    b    x 10.333333
# 3    a    z  2.666667
# 4    b    z  8.666667

# NON-FORMULA VERSION
aggregate(df$col3, by=list(col1=df$col1, col2=df$col2), mean)
#   col1 col2         x
# 1    a    x  4.333333
# 2    b    x 10.333333
# 3    a    z  2.666667
# 4    b    z  8.666667

Usually, by (being the object-oriented wrapper to tapply) is best for running larger, extensive data frame operations that you need to run subsets through iteratievly. In fact, if you need multiple aggregates, by then becomes useful:

pred <- function(subset_df){      
  df <- data.frame(col1 = subset_df$col1[[1]], 
                   col2 = subset_df$col2[[1]],
                   col3_mean = mean(subset_df$col3),
                   col3_sd = sd(subset_df$col3),
                   col3_median = median(subset_df$col3),
                   col3_min = min(subset_df$col3),
                   col3_max = max(subset_df$col3),
                   col3_sum = sum(subset_df$col3),
                   col3_25pct = quantile(subset_df$col3)[[2]],
                   col3_75pct = quantile(subset_df$col3)[[4]],
                   col3_IQR = IQR(subset_df$col3)
                  )      
  return(df)
}

data_sub_list <- by(df, df[,by_keys], pred)  
data_sub <- do.call(rbind, data_sub_list)

#   col1 col2 col3_mean  col3_sd col3_median col3_min col3_max col3_sum col3_25pct col3_75pct col3_IQR
# 1    a    x  4.333333 1.527525           4        3        6       13        3.5        5.0      1.5
# 2    b    x 10.333333 1.527525          10        9       12       31        9.5       11.0      1.5
# 3    a    z  2.666667 2.081666           2        1        5        8        1.5        3.5      2.0
# 4    b    z  8.666667 2.081666           8        7       11       26        7.5        9.5      2.0

Upvotes: 3

thc
thc

Reputation: 9705

Use dplyr:

library(dplyr)

df %>% group_by(col1, col2) %>% 
  summarize(col3_mean = mean(col3)) %>%
  as.data.frame


  col1 col2 col3_mean
1    a    x     4.333
2    a    z     2.667
3    b    x    10.333
4    b    z     8.667

Upvotes: 0

Related Questions