Reputation: 171

Calculations within subsets of dataframe [R]

Facing difficulties with subset calculations. I am able to get overall stats like average purchase by customer (factor) using ave, tapply, ddply but I am not able to calculate visit by visit stats for each customer. Some simplified data below to illustrate my data and ideal results.

Current Dataframe: (Note that visit #1 is the most recent visit)

  customer  visit      date    purchase_amt
    sarah          2    2013-08-09      5
    sarah          3    2013-07-21      8
    sarah          4    2013-06-23      9
    sarah          5    2013-06-02      1
    sarah          1    2013-08-20      8
    henry          1    2013-07-04      4
    che            1    2013-08-27      2
    che            2    2013-07-27      1
    che            3    2013-07-05      8
    che            4    2013-06-14      3
    dt             3    2013-04-05      9
    dt             2    2013-06-07      1
    dt             1    2013-07-11      6

These are the results I am seeking:

customer  visit    date purchase_amt    days since  amt_diff
sarah       2   2013-08-09  5               19        -3
sarah       3   2013-07-21  8               28        -1
sarah       4   2013-06-23  9               21         8
sarah       5   2013-06-02  1               NA        NA
sarah       1   2013-08-20  8               11         3
henry       1   2013-07-04  4               NA        NA
che         1   2013-08-27  2               31         1
che         2   2013-07-27  1               22        -7
che         3   2013-07-05  8               21         5
che         4   2013-06-14  3               NA        NA
dt          3   2013-04-05  9               NA        NA
dt          2   2013-06-07  1               63        -8
dt          1       2013-07-11    6         34         5

So in summary, I would like to find most recent visit of a customer and the attributes of it, then find the next visit attributes and calculate various stats on the two. Return "NA" when there are no more previous visits.

Upvotes: 2

Answers (3)

G. Grothendieck

Reputation: 269852

This solution only uses the base of R and retains the original order of the input:

# Sort, calculate differences and unsort.
# r is row indexes to use, order.by is ordering vector, col is vector to difference

diffs <- function(r, order.by, col) {
    order.by <- order.by[r]
    col <- col[r]
    o <- order(order.by)
    replace(r, o, c(NA, diff(col[o])))
}

# fun specialized to arguments after first, i.e. subsequent arguments curried

curry <- function (fun, ...) function(r) fun(r, ...)

ix <- 1:nrow(DF)
transform(DF, 
    days_since = ave(ix, customer, FUN = curry(diffs, date, date)),
    amt_diff = ave(ix, customer, FUN = curry(diffs, date, purchase_amt))
)

The result is:

   customer visit       date purchase_amt days_since amt_diff
1     sarah     2 2013-08-09            5         19       -3
2     sarah     3 2013-07-21            8         28       -1
3     sarah     4 2013-06-23            9         21        8
4     sarah     5 2013-06-02            1         NA       NA
5     sarah     1 2013-08-20            8         11        3
6     henry     1 2013-07-04            4         NA       NA
7       che     1 2013-08-27            2         31        1
8       che     2 2013-07-27            1         22       -7
9       che     3 2013-07-05            8         21        5
10      che     4 2013-06-14            3         NA       NA
11       dt     3 2013-04-05            9         NA       NA
12       dt     2 2013-06-07            1         63       -8
13       dt     1 2013-07-11            6         34        5

UPDATE: minor improvements to code.

Upvotes: 6

Metrics

Reputation: 15458

Here is the data.table solution in line with @Henrik:

    df<-structure(list(customer = structure(c(4L, 4L, 4L, 4L, 4L, 3L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("che", "dt", "henry", 
"sarah"), class = "factor"), visit = c(2L, 3L, 4L, 5L, 1L, 1L, 
1L, 2L, 3L, 4L, 3L, 2L, 1L), date = structure(c(15926, 15907, 
15879, 15858, 15937, 15890, 15944, 15913, 15891, 15870, 15800, 
15863, 15897), class = "Date"), purchase_amt = c(5L, 8L, 9L, 
1L, 8L, 4L, 2L, 1L, 8L, 3L, 9L, 1L, 6L)), .Names = c("customer", 
"visit", "date", "purchase_amt"), row.names = c(NA, -13L), class =  
"data.frame")

library(data.table)
 df<-data.table(df)
df[,list(visit=visit,date=date, purchase_amt=purchase_amt,days_since = c(NA, diff(date)),amt_diff = c(NA, diff(purchase_amt))),keyby="customer"]
    customer visit       date purchase_amt days_since amt_diff
 1:      che     1 2013-08-27            2         NA       NA
 2:      che     2 2013-07-27            1        -31       -1
 3:      che     3 2013-07-05            8        -22        7
 4:      che     4 2013-06-14            3        -21       -5
 5:       dt     3 2013-04-05            9         NA       NA
 6:       dt     2 2013-06-07            1         63       -8
 7:       dt     1 2013-07-11            6         34        5
 8:    henry     1 2013-07-04            4         NA       NA
 9:    sarah     2 2013-08-09            5         NA       NA
10:    sarah     3 2013-07-21            8        -19        3
11:    sarah     4 2013-06-23            9        -28        1
12:    sarah     5 2013-06-02            1        -21       -8
13:    sarah     1 2013-08-20            8         79        7

Upvotes: 5

Henrik

Reputation: 67778

Something like this? Assuming your data is called df:

library(plyr)

# convert dates to class 'Date'
df$date <- as.Date(df$date)

# order by customer and date
df <- df[order(df$customer, df$date), ]
# or since plyr is loaded anyway:
df <- arrange(df, customer, date) 

# per customer, calculate differences in date and purchase, between consecutive visits
# pad differences with a leading NA
df2 <- ddply(.data = df, .variables = .(customer), mutate,
      days_since = c(NA, diff(date)),
      amt_diff = c(NA, diff(purchase_amt)))

df2
# customer visit       date purchase_amt days_since amt_diff
# 1       che     4 2013-06-14            3         NA       NA
# 2       che     3 2013-07-05            8         21        5
# 3       che     2 2013-07-27            1         22       -7
# 4       che     1 2013-08-27            2         31        1
# 5        dt     3 2013-04-05            9         NA       NA
# 6        dt     2 2013-06-07            1         63       -8
# 7        dt     1 2013-07-11            6         34        5
# 8     henry     1 2013-07-04            4         NA       NA
# 9     sarah     5 2013-06-02            1         NA       NA
# 10    sarah     4 2013-06-23            9         21        8
# 11    sarah     3 2013-07-21            8         28       -1
# 12    sarah     2 2013-08-09            5         19       -3
# 13    sarah     1 2013-08-20            8         11        3

Upvotes: 7

Calculations within subsets of dataframe [R]

Answers (3)

Related Questions