poppytop
poppytop

Reputation: 341

Join DataFrame based on date which is between dates from another DataFrame

I have two Pandas Data Frames. First one looks like (db1):

id   date_from   date_until   v1   v2
0    2015-06-17  2015-06-19    2    3
0    2015-06-17  2015-07-01    3    4 
1    2015-06-18  2015-06-20    5    6
1    2015-06-19  2015-07-12    5    4 

and the second one (db2), like this:

id      date      v3   v4
0    2015-06-17   4    3
1    2015-06-17   5    6 
1    2015-06-18   6    4  

no i want to compress them so they based on id and date. Date compression should be if db2.date is older or equal than db1.date_from and use the youngest one from db2 witch is still older or equal than db1.data_from. So compressed data should look something like this:

id   date_from   date_until      date      v1   v2   v3   v4
0    2015-06-17  2015-06-19   2015-06-17   2    3    4    3
0    2015-06-17  2015-07-01   2015-06-17   3    4    4    3
1    2015-06-18  2015-06-20   2015-06-18   5    6    6    4
1    2015-06-19  2015-07-12   2015-06-18   5    4    6    4

I can't find any solution except iterating trough all rows, but this is not good enough because my data is bigger 100k of lines. Is there any tool to do this in pandas or should i use something else?

Thank you!

Upvotes: 3

Views: 236

Answers (2)

zipa
zipa

Reputation: 27869

Another way to do it:

import pandas as pd

import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO

a = StringIO('''id,date_from,date_until,v1,v2
0,2015-06-17,2015-06-19,2,3
0,2015-06-17,2015-07-01,3,4
1,2015-06-18,2015-06-20,5,6
1,2015-06-19,2015-07-12,5,4''')

b = StringIO('''id,date,v3,v4
0,2015-06-17,4,3
1,2015-06-17,5,6 
1,2015-06-18,6,4''')

df1 = pd.read_csv(a, sep=',')
df2 = pd.read_csv(b, sep=',')
df1.merge(df2, how='left', left_on=['id', 'date_from'], right_on=['id', 'date']).ffill()

Upvotes: 2

Scott Boston
Scott Boston

Reputation: 153460

Let's use pd.merge_asof:

pd.merge_asof(df1,df2, left_on=['date_from'], right_on=['date'], by='id')

Output:

   id  date_from  date_until  v1  v2       date  v3  v4
0   0 2015-06-17  2015-06-19   2   3 2015-06-17   4   3
1   0 2015-06-17  2015-07-01   3   4 2015-06-17   4   3
2   1 2015-06-18  2015-06-20   5   6 2015-06-18   6   4
3   1 2015-06-19  2015-07-12   5   4 2015-06-18   6   4

Upvotes: 3

Related Questions