Reputation: 6778
I've got a data set, called vistsPerDay, that looks like this but with 405,890 rows and 10,406 unique CUST_ID:
> CUST_ID Date
> 1 2013-09-19
> 1 2013-10-03
> 1 2013-10-08
> 1 2013-10-12
> 1 2013-10-20
> 1 2013-10-25
> 1 2013-11-01
> 1 2013-11-02
> 1 2013-11-08
> 1 2013-11-15
> 1 2013-11-23
> 1 2013-12-02
> 1 2013-12-04
> 1 2013-12-09
> 2 2013-09-16
> 2 2013-09-17
> 2 2013-09-18
What I'd like to do is create a new variable that is the lagged difference between the dates in their visits. Here is the code I'm currently using:
visitsPerDay <- visitsPerDay[order(visitsPerDay$CUST_ID), ]
cust_id <- 0
for (i in 1:nrow(visitsPerDay)) {
if (visitsPerDay$CUST_ID[i] != cust_id) {
cust_id <- visitsPerDay$CUST_ID[i]
visitsPerDay$MTBV <- NA
} else {
visitsPerDay$MBTV <- as.numeric(visitsPerDay$Date[i] - visitsPerDay$Date[i-1])
}
}
I feel like this is certainly not the most efficient way to do this. Does anyone have a better way to approach it? Thanks!
Upvotes: 0
Views: 167
Reputation: 49448
Here's the data.table
solution. This will likely be much faster and is more readable:
dt = data.table(visitsPerDay)
dt[, MBTV := c(NA, diff(as.Date(Date))), by = CUST_ID]
dt
# CUST_ID Date MBTV
# 1: 1 2013-09-19 NA days
# 2: 1 2013-10-03 14 days
# 3: 1 2013-10-08 5 days
# 4: 1 2013-10-12 4 days
# 5: 1 2013-10-20 8 days
# 6: 1 2013-10-25 5 days
# 7: 1 2013-11-01 7 days
# 8: 1 2013-11-02 1 days
# 9: 1 2013-11-08 6 days
#10: 1 2013-11-15 7 days
#11: 1 2013-11-23 8 days
#12: 1 2013-12-02 9 days
#13: 1 2013-12-04 2 days
#14: 1 2013-12-09 5 days
#15: 2 2013-09-16 NA days
#16: 2 2013-09-17 1 days
#17: 2 2013-09-18 1 days
Upvotes: 1
Reputation: 81693
Here's an approach with tapply
:
# transform 'Date' to values of class 'Date' (maybe already done)
visitsPerDay$Date <- as.Date(visitsPerDay$Date)
visitsPerDay <- transform(visitsPerDay,
MBTV = unlist(tapply(Date,
CUST_ID,
FUN = function(x) c(NA,diff(x)))))
The result:
CUST_ID Date MBTV
11 1 2013-09-19 NA
12 1 2013-10-03 14
13 1 2013-10-08 5
14 1 2013-10-12 4
15 1 2013-10-20 8
16 1 2013-10-25 5
17 1 2013-11-01 7
18 1 2013-11-02 1
19 1 2013-11-08 6
110 1 2013-11-15 7
111 1 2013-11-23 8
112 1 2013-12-02 9
113 1 2013-12-04 2
114 1 2013-12-09 5
21 2 2013-09-16 NA
22 2 2013-09-17 1
23 2 2013-09-18 1
Edit: A faster approach:
# transform 'Date' to values of class 'Date' (maybe already done)
visitsPerDay$Date <- as.Date(visitsPerDay$Date)
visitsPerDay$MBTV <- c(NA_integer_,
"is.na<-"(diff(visitsPerDay$Date),
!duplicated(visitsPerDay$CUST_ID)[-1]))
Upvotes: 0
Reputation: 178461
You can accelerate the process by doing a bucket sort rather than ordinary sort, since you are sorting by the cust_id. Note that the bottleneck in the algorithm (in terms of big O notation) is the sorting, which is O(nlogn)
.
The following pseudo code assumes the data is given sorted by date (same assumption you need for the suggested code in the answer):
//bucket sort:
customers <- new array of size 10406
for each (cust_id,date):
if customers[cust_id] == nil:
customers[cust_id] = []
customers[cust_id].append(date)
//find differences:
for each list in customers:
i <- list.iter()
prev = i.next()
while (i.hasNext()):
curr <- i.next()
output diff(prev,curr)
prev <- curr
The above code runs in O(n)
, which is theoretically better than your approach (for large enough inputs), at the cost of more memory consumption.
Upvotes: 0