tblznbits
tblznbits

Reputation: 6778

Algorithm Efficiency - Time Difference Loop

I've got a data set, called vistsPerDay, that looks like this but with 405,890 rows and 10,406 unique CUST_ID:

> CUST_ID   Date
> 1         2013-09-19
> 1         2013-10-03
> 1         2013-10-08
> 1         2013-10-12
> 1         2013-10-20
> 1         2013-10-25
> 1         2013-11-01
> 1         2013-11-02
> 1         2013-11-08
> 1         2013-11-15
> 1         2013-11-23
> 1         2013-12-02
> 1         2013-12-04
> 1         2013-12-09
> 2         2013-09-16
> 2         2013-09-17
> 2         2013-09-18

What I'd like to do is create a new variable that is the lagged difference between the dates in their visits. Here is the code I'm currently using:

visitsPerDay <- visitsPerDay[order(visitsPerDay$CUST_ID), ]
cust_id <- 0
for (i in 1:nrow(visitsPerDay)) {
  if (visitsPerDay$CUST_ID[i] != cust_id) {
    cust_id <- visitsPerDay$CUST_ID[i]
    visitsPerDay$MTBV <- NA
  } else {
    visitsPerDay$MBTV <- as.numeric(visitsPerDay$Date[i] - visitsPerDay$Date[i-1])
  }
}

I feel like this is certainly not the most efficient way to do this. Does anyone have a better way to approach it? Thanks!

Upvotes: 0

Views: 167

Answers (3)

eddi
eddi

Reputation: 49448

Here's the data.table solution. This will likely be much faster and is more readable:

dt = data.table(visitsPerDay)

dt[, MBTV := c(NA, diff(as.Date(Date))), by = CUST_ID]
dt
#    CUST_ID       Date    MBTV
# 1:       1 2013-09-19 NA days
# 2:       1 2013-10-03 14 days
# 3:       1 2013-10-08  5 days
# 4:       1 2013-10-12  4 days
# 5:       1 2013-10-20  8 days
# 6:       1 2013-10-25  5 days
# 7:       1 2013-11-01  7 days
# 8:       1 2013-11-02  1 days
# 9:       1 2013-11-08  6 days
#10:       1 2013-11-15  7 days
#11:       1 2013-11-23  8 days
#12:       1 2013-12-02  9 days
#13:       1 2013-12-04  2 days
#14:       1 2013-12-09  5 days
#15:       2 2013-09-16 NA days
#16:       2 2013-09-17  1 days
#17:       2 2013-09-18  1 days

Upvotes: 1

Sven Hohenstein
Sven Hohenstein

Reputation: 81693

Here's an approach with tapply:

# transform 'Date' to values of class 'Date' (maybe already done)
visitsPerDay$Date <- as.Date(visitsPerDay$Date) 

visitsPerDay <- transform(visitsPerDay, 
                          MBTV = unlist(tapply(Date, 
                                               CUST_ID, 
                                               FUN = function(x) c(NA,diff(x)))))

The result:

    CUST_ID       Date MBTV
11        1 2013-09-19   NA
12        1 2013-10-03   14
13        1 2013-10-08    5
14        1 2013-10-12    4
15        1 2013-10-20    8
16        1 2013-10-25    5
17        1 2013-11-01    7
18        1 2013-11-02    1
19        1 2013-11-08    6
110       1 2013-11-15    7
111       1 2013-11-23    8
112       1 2013-12-02    9
113       1 2013-12-04    2
114       1 2013-12-09    5
21        2 2013-09-16   NA
22        2 2013-09-17    1
23        2 2013-09-18    1

Edit: A faster approach:

# transform 'Date' to values of class 'Date' (maybe already done)
visitsPerDay$Date <- as.Date(visitsPerDay$Date) 

visitsPerDay$MBTV <- c(NA_integer_, 
                       "is.na<-"(diff(visitsPerDay$Date), 
                                 !duplicated(visitsPerDay$CUST_ID)[-1]))

Upvotes: 0

amit
amit

Reputation: 178461

You can accelerate the process by doing a bucket sort rather than ordinary sort, since you are sorting by the cust_id. Note that the bottleneck in the algorithm (in terms of big O notation) is the sorting, which is O(nlogn).

The following pseudo code assumes the data is given sorted by date (same assumption you need for the suggested code in the answer):

//bucket sort:
customers <- new array of size 10406
for each (cust_id,date):
   if customers[cust_id] == nil:
        customers[cust_id] = []
   customers[cust_id].append(date)
//find differences:
for each list in customers:
   i <- list.iter()
   prev = i.next()
   while (i.hasNext()):
        curr <- i.next()
        output diff(prev,curr)
        prev <- curr

The above code runs in O(n), which is theoretically better than your approach (for large enough inputs), at the cost of more memory consumption.

Upvotes: 0

Related Questions