Reputation: 715
I have written the following function in R to calculate the two-day mean VARs of each date and previous day for a dataframe with the column names DATE (YYYY-MM-DD), ID, VAR1, and VAR2. There are no missing dates.
df <- data.frame
TWODAY <- function(df){
df$TWODAY_VAR1 <- NA
for(j in 2:length(df$VAR1)){
df$TWODAY_VAR1[j] <- mean(df$VAR1[j:(j-1)])
}
df$TWODAY_VAR2 <- NA
for(j in 2:length(df$VAR2)){
df$TWODAY_VAR2[j] <- mean(df$VAR2[j:(j-1)])
}
return(df)
}
I then applied this function to my dataframe with ddply:
df <- ddply(df, "ID", TWODAY)
However, my dataframe consists of over 13,000,000 observations, and this is running very slow. Does anyone have any recommendations of how I could edit my code to make it more efficient?
Any advice would be greatly appreciated!
Upvotes: 1
Views: 146
Reputation: 28339
Solution using rowMeans
:
nRow <- 13e6
df <- data.frame(VAR1 = rnorm(nRow),
VAR2 = rnorm(nRow))
df$TWODAY_VAR1 <- rowMeans(cbind(df$VAR1, c(NA, df$VAR1[-nrow(df)])))
df$TWODAY_VAR2 <- rowMeans(cbind(df$VAR2, c(NA, df$VAR2[-nrow(df)])))
cbind
two vectors cbind(df$VAR1, c(df$VAR1[-1], NA)
(NA
for last row) and apply rowMeans
.
Upvotes: 2
Reputation: 6685
A manual vectorization:
FOO <- function(x){
c(NA, (x[2:length(x)]+x[1:(length(x)-1)])/2)
}
Example:
set.seed(123)
df <- data.frame(VAR1 = rnorm(10000), VAR2 = runif(10000))
> head(df)
VAR1 VAR2
1 -0.56047565 0.9911234
2 -0.23017749 0.3022307
3 1.55870831 0.4337590
4 0.07050839 0.1605209
5 0.12928774 0.8230267
6 1.71506499 0.2080906
df$TWODAY_VAR1 <- FOO(df$VAR1)
df$TWODAY_VAR2 <- FOO(df$VAR2)
> head(df)
VAR1 VAR2 TWODAY_VAR1 TWODAY_VAR2
1 -0.56047565 0.9911234 NA NA
2 -0.23017749 0.3022307 -0.39532657 0.6466770
3 1.55870831 0.4337590 0.66426541 0.3679948
4 0.07050839 0.1605209 0.81460835 0.2971400
5 0.12928774 0.8230267 0.09989806 0.4917738
6 1.71506499 0.2080906 0.92217636 0.5155586
This should be pretty fast even with 13 Million rows. One Million rows takes a fracture of a second for me.
Benchmark for a single variable with 13.000.000 rows:
> b
Unit: seconds
expr min lq mean median uq max neval
df$TWODAY_VAR1 <- FOO(df$VAR1) 0.182657 0.209106 0.2308234 0.2175971 0.2239455 0.3119504 10
Upvotes: 3