BhishanPoudel
BhishanPoudel

Reputation: 17144

Pandas Advanced: How to get results for customer who has bought at least twice within 5 days of period?

I have been attempting to solve a problem for hours and stuck on it. Here is the problem outline:

import numpy as np
import pandas as pd


df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
          'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
          'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
df

   orderid customerid   orderdate
0    10315      ISLAT  1996-09-26
1    10318      ISLAT  1996-10-01
2    10321      ISLAT  1996-10-03
3    10473      ISLAT  1997-03-13
4    10621      ISLAT  1997-08-05
5    10253      HANAR  1996-07-10
6    10541      HANAR  1997-05-19
7    10645      HANAR  1997-08-26

I would like to select all the customers who has ordered items more than once WITHIN 5 DAYS.

For example, here only the customer ordered within 5 days of period and he has done it twice.

I would like to get the output in the following format:

Required Output

customerid  initial_order_id    initial_order_date  nextorderid nextorderdate   daysbetween
ISLAT       10315               1996-09-26          10318       1996-10-01      5
ISLAT       10318               1996-10-01          10321       1996-10-03      2

Upvotes: 2

Views: 418

Answers (4)

Gecko
Gecko

Reputation: 1408

It is a bit tricky because there can be any number of purchase pairs within 5 day windows. It is a good use case for leveraging merge_asof, which allows to do approximate-but-not-exact matching of a dataframe with itself.

Input data

import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
          'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
          'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})

Define a function that computes the pairs of purchases, given data for a customer.

def compute_purchase_pairs(df):
    # Approximate self join on the date, but not exact.
    df_combined = pd.merge_asof(df,df, left_index=True, right_index=True,
                                suffixes=('_first', '_second') , allow_exact_matches=False)
    # Compute difference
    df_combined['timedelta'] = df_combined['orderdate_first'] - df_combined['orderdate_second']
    return df_combined

Do the preprocessing and compute the pairs

# Convert to datetime
df['orderdate'] = pd.to_datetime(df['orderdate'])
# Sort dataframe from last buy to newest (groupby will not change this order)
df2 = df.sort_values(by='orderdate', ascending=False)
# Create an index for joining
df2 = df.set_index('orderdate', drop=False)

# Compute puchases pairs for each customer
df_differences = df2.groupby('customerid').apply(compute_purchase_pairs)
# Show only the ones we care about
result = df_differences[df_differences['timedelta'].dt.days<=5]
result.reset_index(drop=True)

Result

   orderid_first customerid_first orderdate_first  orderid_second  \
0          10318            ISLAT      1996-10-01         10315.0   
1          10321            ISLAT      1996-10-03         10318.0   

  customerid_second orderdate_second timedelta  
0             ISLAT       1996-09-26    5 days  
1             ISLAT       1996-10-01    2 days  

Upvotes: 2

lsabi
lsabi

Reputation: 4456

It's quite simple. Let's write down the requirements one at the time and try to build upon.

First, I guess that the customer has a unique id since it's not specified. We'll use that id for identifying customers.

Second, I assume it does not matter if the customer bought 5 days before or after.

My solution, is to use a simple filter. Note that this solution can also be implemented in a SQL database.

As a condition, we require the user to be the same. We can achieve this as follows:

new_df = df[df["ID"] == df["ID"].shift(1)]

We create a new DataFrame, namely new_df, with all rows such that the xth row has the same user id as the xth - 1 row (i.e. the previous row).

Now, let's search for purchases within the 5 days, by adding the condition to the previous piece of code

new_df = df[df["ID"] == df["ID"].shift(1) & (df["Date"] - df["Date"].shift(1)) <= 5]

This should do the work. I cannot test it write now, so some fixes may be needed. I'll try to test it as soon as I can

Upvotes: 1

Ben.T
Ben.T

Reputation: 29635

you can create the column 'daysbetween' with sort_values and diff. After to get the following order, you can join df with df once groupby per customerid and shift all the data. Finally, query where the number of days in 'daysbetween_next ' is met:

df['daysbetween'] = df.sort_values(['customerid', 'orderdate'])['orderdate'].diff().dt.days
df_final = df.join(df.groupby('customerid').shift(-1), 
                   lsuffix='_initial', rsuffix='_next')\
             .drop('daysbetween_initial', axis=1)\
             .query('daysbetween_next <= 5 and daysbetween_next >=0')

Upvotes: 1

Valdi_Bo
Valdi_Bo

Reputation: 30971

First, to be able to count the difference in days, convert orderdate column to datetime:

df.orderdate = pd.to_datetime(df.orderdate)

Then define the following function:

def fn(grp):
    return grp[(grp.orderdate.shift(-1) - grp.orderdate) / np.timedelta64(1, 'D') <= 5]

And finally apply it:

df.sort_values(['customerid', 'orderdate']).groupby('customerid').apply(fn)

Upvotes: 2

Related Questions