Reputation: 519
I just start learning Python and R, so any advice using either of them would be much appreciated.
My data are stored in two dataframes. One is sales data, for each consumer, we can see the date when he purchases something. It is possible the same person purchases more than once:
Date Person ID Product
01-05-2012 1 cereal
01-05-2012 2 apple
02-08-2012 3 beef
03-22-2013 72 pot
07-19-2012 1 cake
The second dataframe has membership data, which tell us when did a person enrolled in the program:
Date Person ID Type Status
06-11-2008 1 Gold New
10-12-2011 2 Gold New
02-08-2011 3 Silver Renewal
02-01-2012 72 Gold Renewal
03-22-2012 1 Gold Renewal
What I want to do is, for the same person, how long does it take before a person purchases something before he enrolls in a program.
For example, person 1 got a new membership on 06-11-2008 and purchased cereal on 01-05-2012. I would like to calculate how many days there are between these two dates.
However, these information are stored in separate dataframes. I don't think they can be append or merged into one dataframe, because one person can have more than one observations in one or both of the dataframes.
What I am thinking is, extract all the dates from sales data, and extract all the dates from the license data. Then merge these two new dataframes into a new dataframe. This will give me:
License Date Person ID Sales Date
06-11-2008 1 01-05-2012
10-12-2011 2 01-05-2012
02-08-2011 3 02-08-2011
02-01-2012 72 03-22-2013
06-11-2008 1 07-19-2012
03-22-2012 1 01-05-2012
03-22-2012 1 07-19-2012
But the problem here is, if a person has two license dates (ex. one new and one renewal), then merge the data will give 2*(sales dates)... but I only want the sales dates for a license that is valid..
For example, person 1 used license 06-11-2008 to buy cereal on 01-05-2012, and used license 03-22-2012 to buy on 07-19-2012. But merging the dataframes will give me 4 records rather than the 2 I want...
The result I would want is the time to purchase for each sale, after he gets the license which he used for that purchase:
License Date Person ID Sales Date TimeToPurchase
06-11-2008 1 01-05-2012 ...
10-12-2011 2 01-05-2012 ...
02-08-2011 3 02-08-2011 ...
02-01-2012 72 03-22-2013 ...
03-22-2012 1 07-19-2012 ...
Is there a better way you suggest I can do?
Thank you very much for the help!
Upvotes: 3
Views: 152
Reputation: 33417
Here is a solution using R and library(data.table)
assuming you are only interested in the latest time to purchase:
Edit: after question was updated
library(data.table)
purchaseDT <- data.table(stringsAsFactors=FALSE,
Date = c("01-05-2009", "01-05-2012", "02-08-2012", "03-22-2013"),
PersonID = c(1, 2, 1, 72),
Product = c("cereal", "apple", "beef", "pot")
)
programDT <- data.table(stringsAsFactors=FALSE,
Date = c("06-11-2008", "10-12-2011", "02-08-2011", "02-01-2012"),
PersonID = c(1, 2, 1, 72),
Type = c("Gold", "Gold", "Silver", "Gold"),
Status = c("New", "New", "Renewal", "Renewal")
)
purchaseDT[, PurchaseDate := as.Date(Date, format="%m-%d-%Y")]
programDT[, LicenseDate := as.Date(Date, format="%m-%d-%Y")]
purchaseDT[, Date := NULL]
programDT[, Date := NULL]
mergedDT <- purchaseDT[programDT, on="PersonID"]
mergedDT[, TimeToPurchase := PurchaseDate-LicenseDate]
mergedDT <- mergedDT[TimeToPurchase > 0]
resultDT <- mergedDT[, .(TimeToPurchase = min(TimeToPurchase)), by = c("LicenseDate", "PersonID")]
resultDT[, PurchaseDate := LicenseDate+TimeToPurchase]
print(resultDT)
Result:
LicenseDate PersonID TimeToPurchase PurchaseDate
1: 2008-06-11 1 208 days 2009-01-05
2: 2011-10-12 2 85 days 2012-01-05
3: 2011-02-08 1 365 days 2012-02-08
4: 2012-02-01 72 415 days 2013-03-22
Upvotes: 1
Reputation: 1312
First your dates need to be saved as datetime, which you can accomplish like this:
sales['Date'] = pd.to_datetime(sales['Date'])
memberships['Date'] = pd.to_datetime(memberships['Date'])
Then you merge them by Person ID
and arrive to the format that has duplicates.
m = sales.merge(memberships, left_on='Person ID', right_on='Person ID',
suffixes=('_sales', '_memberships'))
m
Date_sales Person ID Product Date_memberships Type Status
0 2012-01-05 1 cereal 2008-06-11 Gold New
1 2012-01-05 1 cereal 2012-03-22 Gold Renewal
2 2012-07-19 1 cake 2008-06-11 Gold New
3 2012-07-19 1 cake 2012-03-22 Gold Renewal
4 2012-01-05 2 apple 2011-10-12 Gold New
5 2012-02-08 3 beef 2011-02-08 Silver Renewal
6 2013-03-22 72 pot 2012-02-01 Gold Renewal
Now you can calculate the elapsed days between the sales and the membership dates like this:
m['TimeToPurchase'] = (m['Date_sales'] - m['Date_memberships']).dt.days
m
Date_sales Person ID Product Date_memberships Type Status TimeToPurchase
0 2012-01-05 1 cereal 2008-06-11 Gold New 1303
1 2012-01-05 1 cereal 2012-03-22 Gold Renewal -77
2 2012-07-19 1 cake 2008-06-11 Gold New 1499
3 2012-07-19 1 cake 2012-03-22 Gold Renewal 119
4 2012-01-05 2 apple 2011-10-12 Gold New 85
5 2012-02-08 3 beef 2011-02-08 Silver Renewal 365
6 2013-03-22 72 pot 2012-02-01 Gold Renewal 415
From here you can first eliminate the negatives and then get the minimum TimeToPurchase
for each Person ID and Date sales.
m = m[m['TimeToPurchase'] >= 0]
keep = m.groupby(['Person ID', 'Date_sales'], as_index=False)['TimeToPurchase'].min()
keep
Person ID Date_sales TimeToPurchase
1 2012-01-05 1303
1 2012-07-19 119
2 2012-01-05 85
3 2012-02-08 365
72 2013-03-22 415
These are the records that you want to keep in your merged table, which you can filter with an inner join:
result = m.merge(keep,
left_on=['Person ID', 'Date_sales', 'TimeToPurchase'],
right_on=['Person ID', 'Date_sales', 'TimeToPurchase'])
result
Date_sales Person ID Product Date_memberships Type Status TimeToPurchase
2012-01-05 1 cereal 2008-06-11 Gold New 1303
2012-07-19 1 cake 2012-03-22 Gold Renewal 119
2012-01-05 2 apple 2011-10-12 Gold New 85
2012-02-08 3 beef 2011-02-08 Silver Renewal 365
2013-03-22 72 pot 2012-02-01 Gold Renewal 415
Same logic as above, so I'll just paste the code.
# Date types
sales[, Date := as.Date(Date, format = "%m-%d-%Y")]
memberships[, Date := as.Date(Date, format = "%m-%d-%Y")]
# Merge
m <- memberships[sales, on = "Person ID"]
# Calculate elapsed days
m[, TimeToPurchase := as.numeric(m$i.Date - m$Date)]
# Eliminate negatives
m <- m[TimeToPurchase >= 0]
# Calculate records to keep
keep <- m[, .(TimeToPurchase = min(TimeToPurchase)), by = .(`Person ID`, i.Date)]
# Filter result
result <- m[keep, on = c("Person ID", "i.Date", "TimeToPurchase")]
result
Date Person ID Type Status i.Date Product TimeToPurchase
1: 2008-06-11 1 Gold New 2012-01-05 cereal 1303
2: 2011-10-12 2 Gold New 2012-01-05 apple 85
3: 2011-02-08 3 Silver Renewal 2012-02-08 beef 365
4: 2012-02-01 72 Gold Renewal 2013-03-22 pot 415
5: 2012-03-22 1 Gold Renewal 2012-07-19 cake 119
Upvotes: 2
Reputation: 23574
Here is one idea for you. First, I merged the two data sets using Person_ID
and Date
. In this example, I needed to create a date object (i.e., Date) in the first mutate()
. I sorted the data by Person_ID
and Date
. Then, I created a new grouping variable. What I did was to identify rows where Status
is either "New" or "Renewal". This means that I identified when a license became valid for the first time. That row becomes the first row for each license. For each group
, I chose the first two rows. The data is arranged by Person_ID
and Date
, so the 2nd row should be the one that a customer bought something with the valid license for the first time. Finally, I calculated the interval (i.e., time2purchase
) using Date
.
full_join(df1, df2, by = c("Person_ID", "Date")) %>%
mutate(Date = as.Date(Date, format = "%m-%d-%Y")) %>%
arrange(Person_ID, Date) %>%
mutate(group = findInterval(x = 1:n(), vec = grep(Status, pattern = "New|Renewal"))) %>%
group_by(group) %>%
slice(1:2) %>%
summarize(time2purchase = Date[2]-Date[1])
group time2purchase
<int> <time>
1 1 1303 days
2 2 119 days
3 3 85 days
4 4 365 days
5 5 415 days
To make things clearer, I leave the results below, which you can generate
using mutate() instead of summarize().
Date Person_ID Product Type Status group time2purchase
<date> <int> <chr> <chr> <chr> <int> <time>
1 2008-06-11 1 NA Gold New 1 1303 days
2 2012-03-22 1 NA Gold Renewal 2 119 days
3 2011-10-12 2 NA Gold New 3 85 days
4 2011-02-08 3 NA Silver Renewal 4 365 days
5 2012-02-01 72 NA Gold Renewal 5 415 days
DATA
df1 <-structure(list(Date = c("01-05-2012", "01-05-2012", "02-08-2012",
"03-22-2013", "07-19-2012"), Person_ID = c(1L, 2L, 3L, 72L, 1L
), Product = c("cereal", "apple", "beef", "pot", "cake")), class = "data.frame",
row.names = c(NA,
-5L))
df2 <- structure(list(Date = c("06-11-2008", "10-12-2011", "02-08-2011",
"02-01-2012", "03-22-2012"), Person_ID = c(1L, 2L, 3L, 72L, 1L
), Type = c("Gold", "Gold", "Silver", "Gold", "Gold"), Status = c("New",
"New", "Renewal", "Renewal", "Renewal")), class = "data.frame", row.names = c(NA,
-5L))
Upvotes: 0