Paul
Paul

Reputation: 1897

Optimizing pandas groupby python

I have a dataframe with contracts of products (with a Product_ID). These contracts are opened at a certain date (StartDate) and closed at a certain moment in time (CloseDate). It is also possible the contract is active at this moment in time, and therefore does not have a CloseDate.

There are multiple clients that have contracts, referenced by ID. These customers fill in surveys at certain moments in time, this moment in time is indicated by a date (Key_Date).

What I want to calculate is a several number of features, however for this example I will focus on the number of unique products. I want to know how many unique products a certain customer has, on the moment of filling out the survey.

We have a dataframe df_result, which has the ID of the customer and the date they filled in the survey. In this dataframe we will also append the calculated feature:

import pandas as pd
import numpy as np
np.random.seed(256)
df_result = pd.DataFrame({'ID' : np.random.randint(3, size=(10)),
                      'Key_Date' : pd.date_range(start=pd.datetime(2015, 5, 21), periods=10, freq='m')})
df_result.head()

    ID  Key_Date
0   0   2015-05-31
1   2   2015-06-30
2   1   2015-07-31
3   0   2015-08-31
4   1   2015-09-30

We also have a dataframe with the different contracts/products, named df_products:

np.random.seed(321)
df_products = pd.DataFrame({'ID' : np.random.randint(5, size=(10)),
                        'Product_ID' : np.random.randint(low = 101, high = 104, size=10),
                      'StartDate' : pd.date_range(start=pd.datetime(2015, 3, 1), periods=10, freq='m'),
                       'CloseDate' : pd.date_range(start=pd.datetime(2016, 1, 1), periods=10, freq='m')})
df_products.head()

    CloseDate   StartDate   ID  Product_ID
0   2016-01-31  2015-03-31  4   102
1   2016-02-29  2015-04-30  2   101
2   2016-03-31  2015-05-31  4   102
3   2016-04-30  2015-06-30  1   102
4   2016-05-31  2015-07-31  0   103

I made a function to count the unique products of the customer who filled in the survey, where the contract was still active on the time of filling out, key_date (so the start date of the contract (StartDate) is before this date, and the end date (CloseDate) is after this date). I also want to be able to give a range before the filling in date, so all unique products that have been active in the past year for example. So even the closed contracts of 11 months ago will be included. I do this by giving an extra parameter timeperiod which I subtract of the filling in date (making a new date: low_date). Then, the CloseDate will have to be later than low_date, instead of the key_date.

def unique_products(df,timeperiod,ID,key_date):
    low_date = key_date - relativedelta(months=timeperiod)
    data = df.loc[(df['StartDate'] <= key_date) & 
                  (df['CloseDate'] >= low_date) &
              (df['ID'] == ID)].groupby(['ID'], as_index = False)['Product_ID'].nunique().reset_index()
    if 'Product_ID' in list(data):
        try:
            return float(data['Product_ID'])
        except:
            return np.nan

After this I append these values in a new column named unique_products in df_result:

df_result['unique_products'] = df_result.apply(lambda row: unique_products(df_products, 3, row['ID'], row['Key_Date']), axis=1)
df_result.head()


    ID  Key_Date    unique_products
0   0   2015-05-31  NaN
1   2   2015-06-30  1.0
2   1   2015-07-31  1.0
3   0   2015-08-31  1.0
4   1   2015-09-30  2.0

However when applying this to my whole dateset, it becomes quite slow due to the fact that each surveyrow must be evaluated since they have different times. Is there any way to improve this?

Thanks for any input :)

Upvotes: 1

Views: 231

Answers (2)

siddharth iyer
siddharth iyer

Reputation: 169

You need to use merge.

merged = pd.merged(df_products,df_results,how='left',on='ID')

Now merged will have all columns of df_products along with 'Key Date', if it is null then the person hasn't filled the survey.

filled_survey = merged.loc[~(merged['Key Date'].isnull())]

Now you can find the timedelta by subtracting the relevant dates and filter accordingly.

Upvotes: 1

df_result['low_date'] = df_result['key_date'] - relativedelta(months=timeperiod) #creating low_date column
df_result2 = pandas.merge(df_result,df_products,how = "outer",on = "ID") #Join both the tables
df_result2 = df_result2[(df_result2['StartDate'] <= df_result2['key_date']) & (df_result2['CloseDate'] >= df_result2['low_date'])] # Filter conditions
df_result2 = df_result2.groupby(['ID','Key_Date'])['Product_ID'].nunique().reset_index()

Try this out once using a cross join instead of a kind of loop you are using.

Upvotes: 0

Related Questions