Reputation: 35
I have two datasets:
One contains house energy certificates issued the last 10 years with an ID for the house and the date it was issued. One house could have more certificates issued, as they can renew it.
The other contains all transactions of houses for the last 10 years and the ID (Which is the same id as in the first dataset)
My problem is then find the Energy certificate value of the house on the date it was being sold. I am able to merge the datasets on the house ID, but not quite sure to deal with the date column.
The Energy Certificates has the column with the "DateIssued" and the Transaction data set has the column "OfficialDateSold". The conditions would then be to find the Energy certificate with the right House ID and then with the date closest to the sold date, but not after.
Snippet of the dataframes:
Transactions:
address_id sold_date
0 1223632151 NaN
1 160073875 2013-09-24
2 160073875 2010-06-16
3 160073875 2009-08-05
4 160073875 2006-12-18
... ... ...
2792726 2147477357 2011-11-03
2792727 2147477357 2014-02-26
2792728 2147477579 2017-05-24
2792729 2147479054 2013-02-04
2792730 2147482539 1993-08-10
Energy Certificate
id certificate_number date_issued
0 1785963944 A2012-274656 27.11.2012 10:32:35
1 512265039 A2010-6435 30.06.2010 13:19:18
2 2003824679 A2014-459214 17.06.2014 11:00:47
3 1902877247 A2011-133593 14.10.2011 12:57:08
4 1620713314 A2009-266 25.12.2009 13:18:32
... ... ... ...
307846 753123775 A2019-1078357 30.11.2019 17:23:59
307847 1927124560 A2019-1078363 30.11.2019 20:44:22
307848 1122610963 A2019-1078371 30.11.2019 22:44:45
307849 28668673 A2019-1078373 30.11.2019 22:56:23
307850 1100393780 A2019-1078377 30.11.2019 23:38:42
Want the output
id certificate_number date_issued sold_date
id = address_id
date_issued <= sold_date
But also to find the Certificate closest to the sold_date(the newest before sold) (I know the dates must be in the same format)
I am using Python with Jupyter Notebook.
Upvotes: 0
Views: 63
Reputation: 862681
I think you need merge_asof
, but first is necessary convert columns to datetimes
s by to_datetime
and remove rows with missing values in sold_date
by DataFrame.dropna
:
df1['sold_date'] = pd.to_datetime(df1['sold_date'])
df2['date_issued'] = pd.to_datetime(df2['date_issued'], dayfirst=True)
df1 = df1.dropna(subset=['sold_date'])
df = pd.merge_asof(df2.sort_values('date_issued'),
df1.sort_values('sold_date'),
left_on='date_issued',
right_on='sold_date',
left_by='id',
right_by='address_id')
Upvotes: 1