Reputation: 202
I have 2 dataframes, df1 and df2.
df1 (the lookup)
group_id date value
0 105716 1/30/2019 Soccer
1 105717 1/30/2019 Football
2 105718 1/30/2019 Rest
3 105719 1/30/2019 Soccer
4 105716 1/31/2019 Rest
5 105717 1/31/2019 Rest
6 105718 02/01/2019 Football
7 105719 02/01/2019 Soccer
8 105719 02/02/2019 Tennis
8 105722 02/03/2019 Tennis
df2 (the parent)
GROUP_ID STARTDATE ENDDATE
0 105716 1/30/2019 1/30/2019
1 105717 1/30/2019 1/30/2019
2 105718 1/30/2019 1/30/2019
3 105719 1/30/2019 1/30/2019
4 105716 1/30/2019 1/31/2019
5 105717 1/31/2019 1/31/2019
6 105718 1/31/2019 1/31/2019
7 105719 1/31/2019 1/31/2019
8 105716 1/31/2019 1/31/2019
9 105717 1/31/2019 1/31/2019
10 105718 1/31/2019 1/31/2019
11 105719 1/31/2019 2/1/2019
12 105716 2/1/2019 2/1/2019
13 105717 2/1/2019 2/1/2019
14 105718 2/1/2019 2/1/2019
15 105719 2/1/2019 2/1/2019
16 105716 2/1/2019 2/1/2019
17 105717 2/1/2019 2/1/2019
18 105718 2/1/2019 2/1/2019
19 105719 2/1/2019 2/1/2019
20 105716 2/1/2019 2/2/2019
21 105717 2/2/2019 2/2/2019
22 105718 2/2/2019 2/2/2019
23 105719 2/2/2019 2/2/2019
24 105716 2/2/2019 2/2/2019
25 105717 2/2/2019 2/2/2019
26 105718 2/2/2019 2/2/2019
27 105719 2/2/2019 2/3/2019
28 105716 2/3/2019 2/3/2019
29 105722 2/3/2019 2/3/2019
df2 (the output)
GROUP_ID STARTDATE ENDDATE VALUE
0 105716 1/30/2019 1/30/2019 Soccer
1 105717 1/30/2019 1/30/2019 Football
2 105718 1/30/2019 1/30/2019 Rest
3 105719 1/30/2019 1/30/2019 Soccer
4 105716 1/30/2019 1/31/2019 Rest
5 105717 1/31/2019 1/31/2019 Rest
6 105718 1/31/2019 1/31/2019 None
7 105719 1/31/2019 1/31/2019 None
8 105716 1/31/2019 1/31/2019 Rest
9 105717 1/31/2019 1/31/2019 Rest
10 105718 1/31/2019 1/31/2019 None
11 105719 1/31/2019 2/1/2019 None
12 105716 2/1/2019 2/1/2019 None
13 105717 2/1/2019 2/1/2019 None
14 105718 2/1/2019 2/1/2019 Football
15 105719 2/1/2019 2/1/2019 Soccer
16 105716 2/1/2019 2/1/2019 None
17 105717 2/1/2019 2/1/2019 None
18 105718 2/1/2019 2/1/2019 Football
19 105719 2/1/2019 2/1/2019 Soccer
20 105716 2/1/2019 2/2/2019 None
21 105717 2/2/2019 2/2/2019 None
22 105718 2/2/2019 2/2/2019 None
23 105719 2/2/2019 2/2/2019 Tennis
24 105716 2/2/2019 2/2/2019 None
25 105717 2/2/2019 2/2/2019 None
26 105718 2/2/2019 2/2/2019 None
27 105719 2/2/2019 2/3/2019 None
28 105716 2/3/2019 2/3/2019 None
29 105722 2/3/2019 2/3/2019 Tennis
I am trying to add a RESULT field to df2 and populate it with value from df1 where GROUP_ID = group_id and date is between STARTDATE and ENDDATE while keeping all the rows in df2. Any Nan\Null values will be set to 'None'. I can do this with a loop but it takes a while to go through everything what I tried was numpy.where()
df2['RESULT'] = 'None'
df2.result = np.where(((df1.group_id==df2.GROUP_ID)&((df1.date>=df2.STARTDATE)&(df1.date>=df2.ENDDATE))), df1.value, 'None')
and vectorized approach
df2.result = df1.value[(df1.group_id==df2.GROUP_ID)&((df1.date>=df2.STARTDATE)&(df1.date>=df2.ENDDATE))]
and a merge approach
df_activity = pd.merge(df2, df1, left_on='GROUP_ID', right_on='group_id')[((pd.merge(df2, df1, left_on='GROUP_ID', right_on='group_id')['STARTDATE'] <= pd.merge(df2, df1, left_on='GROUP_ID', right_on='group_id').date)&(pd.merge(df2, df1, left_on='GROUP_ID', right_on='group_id')['ENDDATE'] >= pd.merge(df2, df1, left_on='GROUP_ID', right_on='group_id')['date']))]
The first 2 I tried give me an error
ValueError: Can only compare identically-labeled DataFrame objects
the merge works but it also drops all the rows that don't match. I think I can work around this with another merge but if there is a faster more streamlined way I would like to use it.
Upvotes: 1
Views: 84
Reputation: 202
It seems this question is a duplicate, or at the least closely related to this one here here. From what I gained engaging with the member I implemented this solution.
import pandas as pd
def lookup_value(groupid, date1, date2, lookup):
if len([x for x in lookup[lookup['group_id'].astype(int) == int(groupid)]['date'] if pd.to_datetime(x) >= pd.to_datetime(date1) and pd.to_datetime(x) <= pd.to_datetime(date2)])==0:
return ['None']
else:
match = ((lookup['group_id'].astype(int) == int(groupid)) & (pd.to_datetime(lookup['date'])>=pd.to_datetime(date1)) & (pd.to_datetime(lookup['date'])<=pd.to_datetime(date2)))
value = lookup['value'][match]
return value.values
if __name__ == '__main__':
lookup_cols = ['group_id','date','value']
lookup_data = [['105716','1/30/2019','Soccer'],
['105717','1/30/2019','Football'],
['105718','1/30/2019','Rest'],
['105719','1/30/2019','Soccer'],
['105716','1/31/2019','Rest'],
['105717','1/31/2019','Rest'],
['105718','02/01/2019','Football'],
['105719','02/01/2019','Soccer'],
['105719','02/02/2019','Tennis'],
['105722','02/03/2019','Tennis']]
parent_cols = ['GROUP_ID','STARTDATE','ENDDATE']
parent_data = [['105716','1/30/2019','1/30/2019'],
['105717','1/30/2019','1/30/2019'],
['105718','1/30/2019','1/30/2019'],
['105719','1/30/2019','1/30/2019'],
['105716','1/30/2019','1/31/2019'],
['105717','1/31/2019','1/31/2019'],
['105718','1/31/2019','1/31/2019'],
['105719','1/31/2019','1/31/2019'],
['105716','1/31/2019','1/31/2019'],
['105717','1/31/2019','1/31/2019'],
['105718','1/31/2019','1/31/2019'],
['105719','1/31/2019','2/1/2019'],
['105716','2/1/2019','2/1/2019'],
['105717','2/1/2019','2/1/2019'],
['105718','2/1/2019','2/1/2019'],
['105719','2/1/2019','2/1/2019'],
['105716','2/1/2019','2/1/2019'],
['105717','2/1/2019','2/1/2019'],
['105718','2/1/2019','2/1/2019'],
['105719','2/1/2019','2/1/2019'],
['105716','2/1/2019','2/2/2019'],
['105717','2/2/2019','2/2/2019'],
['105718','2/2/2019','2/2/2019'],
['105719','2/2/2019','2/2/2019'],
['105716','2/2/2019','2/2/2019'],
['105717','2/2/2019','2/2/2019'],
['105718','2/2/2019','2/2/2019'],
['105719','2/2/2019','2/3/2019'],
['105716','2/3/2019','2/3/2019'],
['105722','2/3/2019','2/3/2019']]
parent_df = pd.DataFrame(data=parent_data, columns=parent_cols)
lookup_df = pd.DataFrame(data=lookup_data, columns=lookup_cols)
parent_df['VALUE'] = parent_df.apply(lambda x: lookup_value(x['GROUP_ID'], x['STARTDATE'], x['ENDDATE'], lookup_df)[0], axis=1)
print(parent_df)
Output:
GROUP_ID STARTDATE ENDDATE VALUE
0 105716 1/30/2019 1/30/2019 Soccer
1 105717 1/30/2019 1/30/2019 Football
2 105718 1/30/2019 1/30/2019 Rest
3 105719 1/30/2019 1/30/2019 Soccer
4 105716 1/30/2019 1/31/2019 Soccer
5 105717 1/31/2019 1/31/2019 Rest
6 105718 1/31/2019 1/31/2019 None
7 105719 1/31/2019 1/31/2019 None
8 105716 1/31/2019 1/31/2019 Rest
9 105717 1/31/2019 1/31/2019 Rest
10 105718 1/31/2019 1/31/2019 None
11 105719 1/31/2019 2/1/2019 Soccer
12 105716 2/1/2019 2/1/2019 None
13 105717 2/1/2019 2/1/2019 None
14 105718 2/1/2019 2/1/2019 Football
15 105719 2/1/2019 2/1/2019 Soccer
16 105716 2/1/2019 2/1/2019 None
17 105717 2/1/2019 2/1/2019 None
18 105718 2/1/2019 2/1/2019 Football
19 105719 2/1/2019 2/1/2019 Soccer
20 105716 2/1/2019 2/2/2019 None
21 105717 2/2/2019 2/2/2019 None
22 105718 2/2/2019 2/2/2019 None
23 105719 2/2/2019 2/2/2019 Tennis
24 105716 2/2/2019 2/2/2019 None
25 105717 2/2/2019 2/2/2019 None
26 105718 2/2/2019 2/2/2019 None
27 105719 2/2/2019 2/3/2019 Tennis
28 105716 2/3/2019 2/3/2019 None
29 105722 2/3/2019 2/3/2019 Tennis
Process finished with exit code 0
Upvotes: 0
Reputation: 120409
merge
first on group_id
then query
to filter out your list by dates:
out = (df2.merge(df1, on='group_id')
.query("(startdate <= date) & (date <= enddate)"))
print(out)
# Output:
group_id startdate enddate date value
0 1 2023-01-15 2023-01-20 2023-01-17 100
Input dataframes:
>>> df1
group_id date value
0 1 2023-01-17 100
1 1 2023-01-30 200
>>> df2
group_id startdate enddate
0 1 2023-01-15 2023-01-20
Upvotes: 1