DannyTG
DannyTG

Reputation: 35

Merging two DataFrames (Datasets) on a specific ID column but with Date condition

I have two datasets:

  1. One contains house energy certificates issued the last 10 years with an ID for the house and the date it was issued. One house could have more certificates issued, as they can renew it.

  2. The other contains all transactions of houses for the last 10 years and the ID (Which is the same id as in the first dataset)

My problem is then find the Energy certificate value of the house on the date it was being sold. I am able to merge the datasets on the house ID, but not quite sure to deal with the date column.

The Energy Certificates has the column with the "DateIssued" and the Transaction data set has the column "OfficialDateSold". The conditions would then be to find the Energy certificate with the right House ID and then with the date closest to the sold date, but not after.

Snippet of the dataframes:

Transactions:

         address_id sold_date
0        1223632151           NaN
1         160073875    2013-09-24
2         160073875    2010-06-16
3         160073875    2009-08-05
4         160073875    2006-12-18
...             ...           ...
2792726  2147477357    2011-11-03
2792727  2147477357    2014-02-26
2792728  2147477579    2017-05-24
2792729  2147479054    2013-02-04
2792730  2147482539    1993-08-10

Energy Certificate

                id   certificate_number      date_issued
0       1785963944   A2012-274656  27.11.2012 10:32:35
1        512265039     A2010-6435  30.06.2010 13:19:18
2       2003824679   A2014-459214  17.06.2014 11:00:47
3       1902877247   A2011-133593  14.10.2011 12:57:08
4       1620713314      A2009-266  25.12.2009 13:18:32
...            ...            ...                  ...
307846   753123775  A2019-1078357  30.11.2019 17:23:59
307847  1927124560  A2019-1078363  30.11.2019 20:44:22
307848  1122610963  A2019-1078371  30.11.2019 22:44:45
307849    28668673  A2019-1078373  30.11.2019 22:56:23
307850  1100393780  A2019-1078377  30.11.2019 23:38:42

Want the output

 id                certificate_number      date_issued    sold_date
 id = address_id

 date_issued <= sold_date 

But also to find the Certificate closest to the sold_date(the newest before sold) (I know the dates must be in the same format)

I am using Python with Jupyter Notebook.

Upvotes: 0

Views: 63

Answers (1)

jezrael
jezrael

Reputation: 862681

I think you need merge_asof, but first is necessary convert columns to datetimess by to_datetime and remove rows with missing values in sold_date by DataFrame.dropna:

df1['sold_date'] = pd.to_datetime(df1['sold_date'])
df2['date_issued'] = pd.to_datetime(df2['date_issued'], dayfirst=True)
df1 = df1.dropna(subset=['sold_date'])


df = pd.merge_asof(df2.sort_values('date_issued'), 
                   df1.sort_values('sold_date'), 
                   left_on='date_issued',
                   right_on='sold_date',
                   left_by='id',
                   right_by='address_id')

Upvotes: 1

Related Questions