Gopala
Gopala

Reputation: 10473

R dplyr does not finish lagged date difference computation

I have a data frame such as this:

> bp
Source: local data frame [6 x 4]

        date amount accountId type
1 2015-06-11  101.2         1    a
2 2015-06-18  101.2         1    a
3 2015-06-24  101.2         1    b
4 2015-06-11  294.0         2    a
5 2015-06-18   48.0         2    a
6 2015-06-26   10.0         2    b

It has 3.4 million rows of data:

> nrow(bp)
[1] 3391874
>

I am trying to compute lagged differences of time in days as follows using dplyr:

bp <- bp %>% group_by(accountId) %>%
  mutate(diff = as.numeric(date - lag(date)))

On my 8GB memory macbook, R crashes. On a 64GB Linux server the code is taking forever. Any ideas on fixing this problem?

Upvotes: 2

Views: 944

Answers (1)

thelatemail
thelatemail

Reputation: 93813

No idea what has gone wrong over your way, but with date as a proper Date object, everything goes very quickly over here:

Recreate some data:

dat <- read.table(text="        date amount accountId type
1 2015-06-11  101.2         1    a
2 2015-06-18  101.2         1    a
3 2015-06-24  101.2         1    b
4 2015-06-11  294.0         2    a
5 2015-06-18   48.0         2    a
6 2015-06-26   10.0         2    b",header=TRUE)
dat$date <- as.Date(dat$date)

Then run some analyses on 3.4M rows, 1000 groups:

set.seed(1)
dat2 <- dat[sample(rownames(dat),3.4e6,replace=TRUE),]
dat2$accountId <- sample(1:1000,3.4e6,replace=TRUE)
nrow(dat2)
#[1] 3400000
length(unique(dat2$accountId))
#[1] 1000

system.time({
dat2 <- dat2 %>% group_by(accountId) %>%
  mutate(diff = as.numeric(date - lag(date)))
})
#  user  system elapsed 
#  0.38    0.03    0.40 

head(dat2[dat2$accountId==46,])
#Source: local data frame [6 x 6]
#Groups: accountId
#
#        date amount accountId type diff
#1 2015-06-24  101.2        46    b   NA
#2 2015-06-18   48.0        46    a   -6
#3 2015-06-11  294.0        46    a  -13
#4 2015-06-18  101.2        46    a    7
#5 2015-06-26   10.0        46    b    2
#6 2015-06-11  294.0        46    a    0

Upvotes: 4

Related Questions