Ben Wilson
Ben Wilson

Reputation: 2676

Speed up look ahead, behind time-series search in pandas DataFrame

I've got a pandas DataFrame comprising items from an e-commerce site. I'm checking whether or not items still exist in the data frame on either subsequent (look ahead) or previous (behind) days, with a function like this:

#get a unique set of product urls for the next 2 days. If the current row's URL is in that set, mark it as 0 (not expiring); otherwise mark it 1 (expiring).

def is_expiring_product(row):
    if row.childurl in df[(df['date'] > row['date']) & (df['date'] < (row['date']  + np.timedelta64(2,"D")) ) ].childurl.unique():
        row.is_expiring_product = 0
    else:
        row.is_expiring_product = 1
    return row

The problem is that this takes ages (24 hours+, if it ever completes) to apply to a frame of 1M+ records.

Intuition tells me there must be a more performant method for this...

Maybe creating a separate DataFrame of unique URLs for each day, then seeking on that DataFrame instead of the larger frame? I'm not sure why that would be any faster though...

Upvotes: 0

Views: 214

Answers (1)

Valdi_Bo
Valdi_Bo

Reputation: 31011

Example of code #1: Compute is_expiring_product for a single date

For the test I used the following DataFrame (date column is of type datetime64[ns]):

        date  url
0 2019-09-10  Xxx
1 2019-09-12  Xxx
2 2019-09-14  Xxx
3 2019-09-11  Yyy
4 2019-09-12  Yyy
5 2019-09-12  Zz1
6 2019-09-12  Zz2
7 2019-09-08  Ttt
8 2019-09-12  Ttt
9 2019-09-15  Ttt

The starting point is to define 2 variables, used in cutting of df, by date, into bins (Past, Today and Future):

dMin = pd.to_datetime(0)            # "Minimal" date
dMax = pd.to_datetime('2050-12-31') # "Maximal" date

Assume that we want to compute is_expiring_product only for 2019-09-12, so the first thing is to set currDate (will also be used in cutting):

currDate = pd.to_datetime('2019-09-12')

Then add is_expiring_product column, initially filled with empty strings:

df = df.assign(is_expiring_product='')

Compute bucket - the result of cutting dates into bins:

bucket = pd.cut(df.date, bins = [dMin, currDate - pd.offsets.Day(1),
    currDate, dMax], labels=['Past', 'Today', 'Future'])

The next step is to compute df_current - a subset of rows for the "current" date:

df_current = df[bucket == 'Today']

To generate indices of urls in rows for "other" dates, run:

ind_other = pd.Index(df[bucket.isin(['Past', 'Future'])].url.unique())

And the last step is to generate values for is_expiring_product and update this column:

df.is_expiring_product.update((~df_current.url.isin(ind_other)).astype(int))

The result is:

        date  url is_expiring_product
0 2019-09-10  Xxx                    
1 2019-09-12  Xxx                   0
2 2019-09-14  Xxx                    
3 2019-09-11  Yyy                    
4 2019-09-12  Yyy                   0
5 2019-09-12  Zz1                   1
6 2019-09-12  Zz2                   1
7 2019-09-08  Ttt                    
8 2019-09-12  Ttt                   0
9 2019-09-15  Ttt                    

Description of results:

  • Non-empty values are for 2019-09-12.
  • Index == 1, 4 and 12: 0 - These urls (Xxx, Yyy and Ttt) are present on other dates.
  • Index == 5 and 6: 1 - These urls (Zz1 and Zz2) are not present on other dates.

Example of code #2: Compute is_expiring_product for all dates

The source DataFrame is the same, start from defining dMin and dMax as before.

Then define the following function to set the expiration status:

def setExpired(grp):
    currDate = grp.name    # Grouping key
    bucket = pd.cut(df.date, bins = [dMin, currDate - pd.offsets.Day(1),
        currDate, dMax], labels=['Past', 'Current', 'Future'])
    df_current = df[bucket == 'Current']
    ind_other = pd.Index(df[bucket.isin(['Past', 'Future'])].url.unique())
    res = (~df_current.url.isin(ind_other)).astype(int)
    return res

And the whole computing boils down to a single instruction, applicating this function to each group (for each date):

df['is_expiring_product'] = df.groupby('date').apply(setExpired).droplevel(0)

The additional call to droplevel(0) is required, because the result of apply has a MultiIndex with:

  • level 0 - grouping dates,
  • level 1 - keys of each Series returned by calls to setExpired.

So in order to save it in df (using "ordinary" index), the top level of the MultiIndex must be dropped.

This time the result is:

        date  url  is_expiring_product
0 2019-09-10  Xxx                    0
1 2019-09-12  Xxx                    0
2 2019-09-14  Xxx                    0
3 2019-09-11  Yyy                    0
4 2019-09-12  Yyy                    0
5 2019-09-12  Zz1                    1
6 2019-09-12  Zz2                    1
7 2019-09-08  Ttt                    0
8 2019-09-12  Ttt                    0
9 2019-09-15  Ttt                    0

Yet another, simpler and quicker solution

Note that there is no need to compute is_expiring_product for each row separately.

Much simpler and quicker, totally different solution is to:

  • compute is_expiring_product for each url (without repetitions),
  • then save the result in all rows containing this url.

This solution works as follows:

  • Group rows by url.
  • For each group:
    • take date column,
    • check whether all dates are the same,
    • if the are, return 1 (this url occurs only on a single date),
    • otherwise return 0 (this url occurs on multiple dates).
  • "Expand" this result on each row containing this url.
  • Save the result in is_expiring_product column.

The whole above action can be saved as a single instruction:

df['is_expiring_product'] = df.groupby('url').date\
    .transform(lambda grp: 1 if grp.unique().size == 1 else 0)

with the same result as before.

Please write a comment, how long did it take to execute the above instruction on your DataFrame and how many rows it has.

Upvotes: 1

Related Questions