user9195416
user9195416

Reputation:

Split-apply-combine with aggregate : can the applied function accept multiple arguments that are specified variables of the original data?

Some context: On my quest to improve my R-code I'm trying to replace my for-loops whenever I can by R's apply-class functions.

The question: Are R's apply functions such as sapply, tapply, aggregate, etc. useful for applying functions that are more complicated in the sense that they take as arguments specified variables of the original data?

Simple examples of what works and what does not: I have a dataframe with one time variable date.time and two numeric variables val.one and value.two:

Generate the data:

df <- data.frame(date.time = seq(ymd_hms("2000-01-01 00:00:00"),ymd_hms("2000-01-03 00:00:00"), length.out=100),value.one = c(1:100), value.two = c(1:100) + 10)

I would like to apply a function to every 10 hour cut of the dataframe that has as its two arguments the two numeric variables of the dataframe. For example, if I want to compute the mean of each of the two values for each 10 hour cut the solution is the following:

A function that computes the mean of value.one and value.two for each time period of 10 hours:

work_on_subsets <- function(data, time.step = "10 hours"){
    aggregate(data[,-1], list(cut(df$date.time, breaks = time.step)), function(x) mean(x))}

However, If I want to work with the two data values separately to run another function, say compute the som of the two averages, I run into trouble. The function work_on_subsets_2 gives me the following error : Error in x$value.one : $ operator is invalid for atomic vectors

A function that computes the sum of the means of value.one and value.two for each 10 hour time period:

work_on_subsets_2 <- function(data, time.step = "10 hours"){
    aggregate(data, list(cut(df$date.time, breaks = time.step)), function(x) mean(x$value.one) + mean(x$value.two)}

In the limit, I would like to be able to do something like this:

A function that runs another_function on value.one and value.two for each time period of 10 hours :

another_function <- function(a,b) {
    # do something with a and b
}
work_on_subsets_3 <- function(data, time.step = "10 hours"){aggregate(data, list(cut(df$date.time, breaks = time.step)), another_function(x$value.one, x$value.two))}

Am I using the wrong tools for this job? I have already a working solution using for loops, but I'm trying to get a grip on the split-apply-combine strategy. If so, are there any viable alternatives to for-loops?

Upvotes: 1

Views: 227

Answers (1)

Sarah
Sarah

Reputation: 3519

Hi there are a basic things you are doing wrong here. You are creating a function which has data as its data.frame but you are still referencing df from the global environment. You're also missing at least one bracket. And I don't quite know why you have two layers of functions embedded.

My solution departs from your method but hopefully will help. I'd recommend using plyr package when you want to split dataframes and apply functions as I find it much more intuitive. Combining it with dplyr also helps in my opinion. Note: Always load plyr before dplyr or you run into dependency issues.

If I understand your question correctly the below should work, and you could create different functions to apply

library(plyr)
library(dplyr)

#create function you want to apply
MeanFun <- function(data) mean(data[["value.one"]]) + mean(data[["value.two"]])

#add grouping variable to your dataframe. You can link this with pipes (%>%)
 # if you don't want to create a new data.frame, but for testing purposes it 
 # more clearly shows wants happening 
df1 <- df %>% mutate(Breaks = cut(date.time, breaks = time.step)) 
# use plyr's ssply to split the dataframe on "Breaks" column and apply the function
out <- ddply(df1, "Breaks", MeanFun)

Upvotes: 0

Related Questions