Reputation: 10473
I have a data frame such as this:
> bp
Source: local data frame [6 x 4]
date amount accountId type
1 2015-06-11 101.2 1 a
2 2015-06-18 101.2 1 a
3 2015-06-24 101.2 1 b
4 2015-06-11 294.0 2 a
5 2015-06-18 48.0 2 a
6 2015-06-26 10.0 2 b
It has 3.4 million rows of data:
> nrow(bp)
[1] 3391874
>
I am trying to compute lagged differences of time in days as follows using dplyr:
bp <- bp %>% group_by(accountId) %>%
mutate(diff = as.numeric(date - lag(date)))
On my 8GB memory macbook, R crashes. On a 64GB Linux server the code is taking forever. Any ideas on fixing this problem?
Upvotes: 2
Views: 944
Reputation: 93813
No idea what has gone wrong over your way, but with date
as a proper Date
object, everything goes very quickly over here:
Recreate some data:
dat <- read.table(text=" date amount accountId type
1 2015-06-11 101.2 1 a
2 2015-06-18 101.2 1 a
3 2015-06-24 101.2 1 b
4 2015-06-11 294.0 2 a
5 2015-06-18 48.0 2 a
6 2015-06-26 10.0 2 b",header=TRUE)
dat$date <- as.Date(dat$date)
Then run some analyses on 3.4M rows, 1000 groups:
set.seed(1)
dat2 <- dat[sample(rownames(dat),3.4e6,replace=TRUE),]
dat2$accountId <- sample(1:1000,3.4e6,replace=TRUE)
nrow(dat2)
#[1] 3400000
length(unique(dat2$accountId))
#[1] 1000
system.time({
dat2 <- dat2 %>% group_by(accountId) %>%
mutate(diff = as.numeric(date - lag(date)))
})
# user system elapsed
# 0.38 0.03 0.40
head(dat2[dat2$accountId==46,])
#Source: local data frame [6 x 6]
#Groups: accountId
#
# date amount accountId type diff
#1 2015-06-24 101.2 46 b NA
#2 2015-06-18 48.0 46 a -6
#3 2015-06-11 294.0 46 a -13
#4 2015-06-18 101.2 46 a 7
#5 2015-06-26 10.0 46 b 2
#6 2015-06-11 294.0 46 a 0
Upvotes: 4