Reputation: 11
Here is what I want to Do: I have a dataframe df defined as:
col1 <- c("a","a","a","a","a","a","b","b","b","b","b","b")
col2 <- c("z","z","x","x","z","x", "z","z","x","x","z","x")
col3 <- c(1,2,3,4,5,6,7,8,9,10,11,12)
df <- data.frame(col1,col2,col3)
and a function pred that calculates the mean defined as :
pred <- function(subset_df){return(mean(subset_df$col3))}
I want a data frame through a by function in a below format:
col1 col2 col3_mean
a x 4.33
a z 2.66
b x 10.33
b z 8.66
I am currently using a by() function to partition this data into its strata and apply a pred() function that calculates a mean
by_keys <- c("col1","col2")
data_sub <- by(df, data_sub[,by_keys], pred)
data_sub <- do.call(rbind, data_sub)
I am getting an error here saying the "Error in do.call(rbind, data_sub) : second argument must be a list"
I tried a solution from a similar tread but I dont get col1 and col2 as in desired format
as.data.frame(vapply(data_sub,unlist,unlist(data_sub[[1]])))
Would appreciate any help on this.
Upvotes: 1
Views: 1380
Reputation: 107697
Indeed, by
as you set up will not return a list but a simplified structure since your output returns numeric vectors. Adjust your pred function to return data frames which being non-simplified structures will force by
to return a list and can then be passed into do.call
.
pred <- function(subset_df){
df <- data.frame(col1 = subset_df$col1[[1]],
col2 = subset_df$col2[[1]],
col3_mean = mean(subset_df$col3)
)
return(df)
}
data_sub_list <- by(df, df[,by_keys], pred)
data_sub <- do.call(rbind, data_sub_list)
data_sub
# col1 col2 col3_mean
# 1 a x 4.333333
# 2 b x 10.333333
# 3 a z 2.666667
# 4 b z 8.666667
However, as commented by @Onyambu, this type of grouped aggregation can be done with aggregate
which will return data frames.
# FORMULA VERSION
aggregate(col3 ~ col1 + col2, df, mean)
# col1 col2 col3_mean
# 1 a x 4.333333
# 2 b x 10.333333
# 3 a z 2.666667
# 4 b z 8.666667
# NON-FORMULA VERSION
aggregate(df$col3, by=list(col1=df$col1, col2=df$col2), mean)
# col1 col2 x
# 1 a x 4.333333
# 2 b x 10.333333
# 3 a z 2.666667
# 4 b z 8.666667
Usually, by
(being the object-oriented wrapper to tapply
) is best for running larger, extensive data frame operations that you need to run subsets through iteratievly. In fact, if you need multiple aggregates, by
then becomes useful:
pred <- function(subset_df){
df <- data.frame(col1 = subset_df$col1[[1]],
col2 = subset_df$col2[[1]],
col3_mean = mean(subset_df$col3),
col3_sd = sd(subset_df$col3),
col3_median = median(subset_df$col3),
col3_min = min(subset_df$col3),
col3_max = max(subset_df$col3),
col3_sum = sum(subset_df$col3),
col3_25pct = quantile(subset_df$col3)[[2]],
col3_75pct = quantile(subset_df$col3)[[4]],
col3_IQR = IQR(subset_df$col3)
)
return(df)
}
data_sub_list <- by(df, df[,by_keys], pred)
data_sub <- do.call(rbind, data_sub_list)
# col1 col2 col3_mean col3_sd col3_median col3_min col3_max col3_sum col3_25pct col3_75pct col3_IQR
# 1 a x 4.333333 1.527525 4 3 6 13 3.5 5.0 1.5
# 2 b x 10.333333 1.527525 10 9 12 31 9.5 11.0 1.5
# 3 a z 2.666667 2.081666 2 1 5 8 1.5 3.5 2.0
# 4 b z 8.666667 2.081666 8 7 11 26 7.5 9.5 2.0
Upvotes: 3
Reputation: 9705
Use dplyr
:
library(dplyr)
df %>% group_by(col1, col2) %>%
summarize(col3_mean = mean(col3)) %>%
as.data.frame
col1 col2 col3_mean
1 a x 4.333
2 a z 2.667
3 b x 10.333
4 b z 8.667
Upvotes: 0