Reputation: 519

Subtract dates across DataFrames

I just start learning Python and R, so any advice using either of them would be much appreciated.

My data are stored in two dataframes. One is sales data, for each consumer, we can see the date when he purchases something. It is possible the same person purchases more than once:

Date             Person ID      Product       
01-05-2012       1              cereal
01-05-2012       2              apple
02-08-2012       3              beef
03-22-2013       72             pot
07-19-2012       1              cake

The second dataframe has membership data, which tell us when did a person enrolled in the program:

Date             Person ID      Type      Status      
06-11-2008       1              Gold      New
10-12-2011       2              Gold      New    
02-08-2011       3              Silver    Renewal
02-01-2012       72             Gold      Renewal
03-22-2012       1              Gold      Renewal

What I want to do is, for the same person, how long does it take before a person purchases something before he enrolls in a program.

For example, person 1 got a new membership on 06-11-2008 and purchased cereal on 01-05-2012. I would like to calculate how many days there are between these two dates.

However, these information are stored in separate dataframes. I don't think they can be append or merged into one dataframe, because one person can have more than one observations in one or both of the dataframes.

What I am thinking is, extract all the dates from sales data, and extract all the dates from the license data. Then merge these two new dataframes into a new dataframe. This will give me:

License Date     Person ID      Sales Date            
06-11-2008       1              01-05-2012      
10-12-2011       2              01-05-2012         
02-08-2011       3              02-08-2011    
02-01-2012       72             03-22-2013
06-11-2008       1              07-19-2012 
03-22-2012       1              01-05-2012
03-22-2012       1              07-19-2012

But the problem here is, if a person has two license dates (ex. one new and one renewal), then merge the data will give 2*(sales dates)... but I only want the sales dates for a license that is valid..

For example, person 1 used license 06-11-2008 to buy cereal on 01-05-2012, and used license 03-22-2012 to buy on 07-19-2012. But merging the dataframes will give me 4 records rather than the 2 I want...

The result I would want is the time to purchase for each sale, after he gets the license which he used for that purchase:

License Date     Person ID      Sales Date   TimeToPurchase         
06-11-2008       1              01-05-2012      ...
10-12-2011       2              01-05-2012      ...
02-08-2011       3              02-08-2011      ...
02-01-2012       72             03-22-2013      ...
03-22-2012       1              07-19-2012      ...

Is there a better way you suggest I can do?

Thank you very much for the help!

Upvotes: 3

Answers (3)

ismirsehregal

Reputation: 33417

Here is a solution using R and library(data.table) assuming you are only interested in the latest time to purchase:

Edit: after question was updated

library(data.table)

purchaseDT <- data.table(stringsAsFactors=FALSE,
                         Date = c("01-05-2009", "01-05-2012", "02-08-2012", "03-22-2013"),
                         PersonID = c(1, 2, 1, 72),
                         Product = c("cereal", "apple", "beef", "pot")
)

programDT <- data.table(stringsAsFactors=FALSE,
                        Date = c("06-11-2008", "10-12-2011", "02-08-2011", "02-01-2012"),
                        PersonID = c(1, 2, 1, 72),
                        Type = c("Gold", "Gold", "Silver", "Gold"),
                        Status = c("New", "New", "Renewal", "Renewal")
)

purchaseDT[, PurchaseDate := as.Date(Date, format="%m-%d-%Y")]
programDT[, LicenseDate := as.Date(Date, format="%m-%d-%Y")]
purchaseDT[, Date := NULL]
programDT[, Date := NULL]

mergedDT <- purchaseDT[programDT, on="PersonID"]
mergedDT[, TimeToPurchase := PurchaseDate-LicenseDate]
mergedDT <- mergedDT[TimeToPurchase > 0]

resultDT <- mergedDT[, .(TimeToPurchase = min(TimeToPurchase)), by = c("LicenseDate", "PersonID")]
resultDT[, PurchaseDate := LicenseDate+TimeToPurchase]

print(resultDT)

Result:

   LicenseDate PersonID TimeToPurchase PurchaseDate
1:  2008-06-11        1       208 days   2009-01-05
2:  2011-10-12        2        85 days   2012-01-05
3:  2011-02-08        1       365 days   2012-02-08
4:  2012-02-01       72       415 days   2013-03-22

Upvotes: 1

josemz

Reputation: 1312

pandas

First your dates need to be saved as datetime, which you can accomplish like this:

sales['Date'] = pd.to_datetime(sales['Date'])
memberships['Date'] = pd.to_datetime(memberships['Date'])

Then you merge them by Person ID and arrive to the format that has duplicates.

m = sales.merge(memberships, left_on='Person ID', right_on='Person ID',
                suffixes=('_sales', '_memberships'))
m

  Date_sales  Person ID Product Date_memberships    Type   Status
0 2012-01-05          1  cereal       2008-06-11    Gold      New
1 2012-01-05          1  cereal       2012-03-22    Gold  Renewal
2 2012-07-19          1    cake       2008-06-11    Gold      New
3 2012-07-19          1    cake       2012-03-22    Gold  Renewal
4 2012-01-05          2   apple       2011-10-12    Gold      New
5 2012-02-08          3    beef       2011-02-08  Silver  Renewal
6 2013-03-22         72     pot       2012-02-01    Gold  Renewal

Now you can calculate the elapsed days between the sales and the membership dates like this:

m['TimeToPurchase'] = (m['Date_sales'] - m['Date_memberships']).dt.days
m

  Date_sales  Person ID Product Date_memberships    Type   Status  TimeToPurchase
0 2012-01-05          1  cereal       2008-06-11    Gold      New            1303
1 2012-01-05          1  cereal       2012-03-22    Gold  Renewal             -77
2 2012-07-19          1    cake       2008-06-11    Gold      New            1499
3 2012-07-19          1    cake       2012-03-22    Gold  Renewal             119
4 2012-01-05          2   apple       2011-10-12    Gold      New              85
5 2012-02-08          3    beef       2011-02-08  Silver  Renewal             365
6 2013-03-22         72     pot       2012-02-01    Gold  Renewal             415

From here you can first eliminate the negatives and then get the minimum TimeToPurchase for each Person ID and Date sales.

m = m[m['TimeToPurchase'] >= 0]
keep = m.groupby(['Person ID', 'Date_sales'], as_index=False)['TimeToPurchase'].min()
keep

 Person ID Date_sales  TimeToPurchase
         1 2012-01-05            1303
         1 2012-07-19             119
         2 2012-01-05              85
         3 2012-02-08             365
        72 2013-03-22             415

These are the records that you want to keep in your merged table, which you can filter with an inner join:

result = m.merge(keep, 
                 left_on=['Person ID', 'Date_sales', 'TimeToPurchase'], 
                 right_on=['Person ID', 'Date_sales', 'TimeToPurchase'])
result

Date_sales  Person ID Product Date_memberships    Type   Status  TimeToPurchase
2012-01-05          1  cereal       2008-06-11    Gold      New            1303
2012-07-19          1    cake       2012-03-22    Gold  Renewal             119
2012-01-05          2   apple       2011-10-12    Gold      New              85
2012-02-08          3    beef       2011-02-08  Silver  Renewal             365
2013-03-22         72     pot       2012-02-01    Gold  Renewal             415

data.table

Same logic as above, so I'll just paste the code.

# Date types
sales[, Date := as.Date(Date, format = "%m-%d-%Y")]
memberships[, Date := as.Date(Date, format = "%m-%d-%Y")]

# Merge
m <- memberships[sales, on = "Person ID"]

# Calculate elapsed days
m[, TimeToPurchase := as.numeric(m$i.Date - m$Date)]

# Eliminate negatives
m <- m[TimeToPurchase >= 0]

# Calculate records to keep
keep <- m[, .(TimeToPurchase = min(TimeToPurchase)), by = .(`Person ID`, i.Date)]

# Filter result
result <- m[keep, on = c("Person ID", "i.Date", "TimeToPurchase")]
result

         Date Person ID   Type  Status     i.Date Product TimeToPurchase
1: 2008-06-11         1   Gold     New 2012-01-05  cereal           1303
2: 2011-10-12         2   Gold     New 2012-01-05   apple             85
3: 2011-02-08         3 Silver Renewal 2012-02-08    beef            365
4: 2012-02-01        72   Gold Renewal 2013-03-22     pot            415
5: 2012-03-22         1   Gold Renewal 2012-07-19    cake            119

Upvotes: 2

jazzurro

Reputation: 23574

Here is one idea for you. First, I merged the two data sets using Person_ID and Date. In this example, I needed to create a date object (i.e., Date) in the first mutate(). I sorted the data by Person_ID and Date. Then, I created a new grouping variable. What I did was to identify rows where Status is either "New" or "Renewal". This means that I identified when a license became valid for the first time. That row becomes the first row for each license. For each group, I chose the first two rows. The data is arranged by Person_ID and Date, so the 2nd row should be the one that a customer bought something with the valid license for the first time. Finally, I calculated the interval (i.e., time2purchase) using Date.

full_join(df1, df2, by = c("Person_ID", "Date")) %>%
mutate(Date = as.Date(Date, format = "%m-%d-%Y")) %>%
arrange(Person_ID, Date) %>%
mutate(group = findInterval(x = 1:n(), vec = grep(Status, pattern = "New|Renewal"))) %>%
group_by(group) %>%
slice(1:2) %>%
summarize(time2purchase = Date[2]-Date[1])

  group time2purchase
  <int> <time>       
1     1 1303 days    
2     2  119 days    
3     3   85 days    
4     4  365 days    
5     5  415 days   

To make things clearer, I leave the results below, which you can generate
using mutate() instead of summarize().

  Date       Person_ID Product Type   Status  group time2purchase
  <date>         <int> <chr>   <chr>  <chr>   <int> <time>       
1 2008-06-11         1 NA      Gold   New         1 1303 days    
2 2012-03-22         1 NA      Gold   Renewal     2  119 days    
3 2011-10-12         2 NA      Gold   New         3   85 days    
4 2011-02-08         3 NA      Silver Renewal     4  365 days    
5 2012-02-01        72 NA      Gold   Renewal     5  415 days

DATA

df1 <-structure(list(Date = c("01-05-2012", "01-05-2012", "02-08-2012", 
"03-22-2013", "07-19-2012"), Person_ID = c(1L, 2L, 3L, 72L, 1L
), Product = c("cereal", "apple", "beef", "pot", "cake")), class = "data.frame", 
row.names = c(NA, 
-5L))

df2 <- structure(list(Date = c("06-11-2008", "10-12-2011", "02-08-2011", 
"02-01-2012", "03-22-2012"), Person_ID = c(1L, 2L, 3L, 72L, 1L
), Type = c("Gold", "Gold", "Silver", "Gold", "Gold"), Status = c("New", 
"New", "Renewal", "Renewal", "Renewal")), class = "data.frame", row.names = c(NA, 
-5L))

Upvotes: 0

Subtract dates across DataFrames

Answers (3)

pandas

data.table

Related Questions