nootskej
nootskej

Reputation: 69

How to extract 1 relevant record with date ranges closest to a record from another df with Python/Pandas?

I'm working with 2 dataframes in Pandas/Python that I loaded in as CSV -

'users' - contains a registration date with an ID (not an index) & more info (multiple of the same user can exist over many years)

    n   ID  age description  days_since_today reg_date
0   17  4EF 33  Empathy      900              01/07/2019
2   1   1AF 23  Purdeys      53               04/02/2016
4   9   8AV 24  Cars         200              20/08/2016

'entry' - contains a date of their entry and a consistent ID to the user ID (not an index)

    nac ID  age description days_since_today entry_date
0   14  4EF 33  something   4                 01/05/2019
2   2   4AA 33  something   320                 01/07/2016
4   1   8SD 33  something   400                 01/01/2019

I need to be able to find instances where an entry was within 1 year of the user's start date, but there could be many users to many entry dates, I need to find the date of each entry closest to the start date.

I have already tried pd.merge_asof() but got a little stuck with it and wasn't sure how to use it in this case. I've made a new column in both dataframes (users, entry) for the dates titled 'days_since_today' which is the date today minus that date tracked as an integer, which I thought would be useful for a comparison here, just not entirely sure how to implement it.

My expected output is to seek instances of entry where users registration date has an entry of less than a year before (but not after) and merge those cells

Please could I have some advice on this one?

Upvotes: 1

Views: 49

Answers (1)

Umar.H
Umar.H

Reputation: 23099

Assuming that users has a unique reg date, we can map date the date onto the entry dataframe then filter by matching years.

next we can apply a groupby and return the lowest index per user by their entry date, within that year.

df1 = users, df2 = entry

df1['reg_date'] = pd.to_datetime(df1['reg_date'])

df2['entry_date'] = pd.to_datetime(df2['entry_date'])

df2['reg_date'] = df2['ID'].map(df1.set_index('ID')['reg_date'])

idx = df2[df2['entry_date'].dt.year == df2['reg_date'].dt.year]\
                           .groupby('ID')['entry_date'].idxmin()


df2.loc[idx]

   nac   ID  age description  days_since_today entry_date   reg_date
0   14  4EF   33   something                 4 2019-01-05 2019-01-07

Upvotes: 1

Related Questions