MrKingsley
MrKingsley

Reputation: 202

Populate a column in a df using another df as a lookup table

I have 2 dataframes, df1 and df2.

df1 (the lookup)    
group_id       date     value
0    105716  1/30/2019    Soccer
1    105717  1/30/2019  Football
2    105718  1/30/2019      Rest
3    105719  1/30/2019    Soccer
4    105716  1/31/2019      Rest
5    105717  1/31/2019      Rest
6    105718  02/01/2019  Football
7    105719  02/01/2019    Soccer
8    105719  02/02/2019    Tennis
8    105722  02/03/2019    Tennis

df2 (the parent)
    GROUP_ID  STARTDATE    ENDDATE
0     105716  1/30/2019  1/30/2019
1     105717  1/30/2019  1/30/2019
2     105718  1/30/2019  1/30/2019
3     105719  1/30/2019  1/30/2019
4     105716  1/30/2019  1/31/2019
5     105717  1/31/2019  1/31/2019
6     105718  1/31/2019  1/31/2019
7     105719  1/31/2019  1/31/2019
8     105716  1/31/2019  1/31/2019
9     105717  1/31/2019  1/31/2019
10    105718  1/31/2019  1/31/2019
11    105719  1/31/2019   2/1/2019
12    105716   2/1/2019   2/1/2019
13    105717   2/1/2019   2/1/2019
14    105718   2/1/2019   2/1/2019
15    105719   2/1/2019   2/1/2019
16    105716   2/1/2019   2/1/2019
17    105717   2/1/2019   2/1/2019
18    105718   2/1/2019   2/1/2019
19    105719   2/1/2019   2/1/2019
20    105716   2/1/2019   2/2/2019
21    105717   2/2/2019   2/2/2019
22    105718   2/2/2019   2/2/2019
23    105719   2/2/2019   2/2/2019
24    105716   2/2/2019   2/2/2019
25    105717   2/2/2019   2/2/2019
26    105718   2/2/2019   2/2/2019
27    105719   2/2/2019   2/3/2019
28    105716   2/3/2019   2/3/2019
29    105722   2/3/2019   2/3/2019

    df2 (the output)
    GROUP_ID  STARTDATE    ENDDATE    VALUE
0     105716  1/30/2019  1/30/2019    Soccer
1     105717  1/30/2019  1/30/2019    Football
2     105718  1/30/2019  1/30/2019    Rest
3     105719  1/30/2019  1/30/2019    Soccer
4     105716  1/30/2019  1/31/2019    Rest
5     105717  1/31/2019  1/31/2019    Rest
6     105718  1/31/2019  1/31/2019    None
7     105719  1/31/2019  1/31/2019    None
8     105716  1/31/2019  1/31/2019    Rest
9     105717  1/31/2019  1/31/2019    Rest
10    105718  1/31/2019  1/31/2019    None
11    105719  1/31/2019   2/1/2019    None
12    105716   2/1/2019   2/1/2019    None
13    105717   2/1/2019   2/1/2019    None
14    105718   2/1/2019   2/1/2019    Football
15    105719   2/1/2019   2/1/2019    Soccer
16    105716   2/1/2019   2/1/2019    None
17    105717   2/1/2019   2/1/2019    None
18    105718   2/1/2019   2/1/2019    Football
19    105719   2/1/2019   2/1/2019    Soccer
20    105716   2/1/2019   2/2/2019    None
21    105717   2/2/2019   2/2/2019    None
22    105718   2/2/2019   2/2/2019    None
23    105719   2/2/2019   2/2/2019    Tennis
24    105716   2/2/2019   2/2/2019    None
25    105717   2/2/2019   2/2/2019    None
26    105718   2/2/2019   2/2/2019    None
27    105719   2/2/2019   2/3/2019    None
28    105716   2/3/2019   2/3/2019    None
29    105722   2/3/2019   2/3/2019    Tennis

I am trying to add a RESULT field to df2 and populate it with value from df1 where GROUP_ID = group_id and date is between STARTDATE and ENDDATE while keeping all the rows in df2. Any Nan\Null values will be set to 'None'. I can do this with a loop but it takes a while to go through everything what I tried was numpy.where()

df2['RESULT'] = 'None'
df2.result = np.where(((df1.group_id==df2.GROUP_ID)&((df1.date>=df2.STARTDATE)&(df1.date>=df2.ENDDATE))), df1.value, 'None')

and vectorized approach

df2.result = df1.value[(df1.group_id==df2.GROUP_ID)&((df1.date>=df2.STARTDATE)&(df1.date>=df2.ENDDATE))]

and a merge approach

df_activity = pd.merge(df2, df1, left_on='GROUP_ID', right_on='group_id')[((pd.merge(df2, df1, left_on='GROUP_ID', right_on='group_id')['STARTDATE'] <= pd.merge(df2, df1, left_on='GROUP_ID', right_on='group_id').date)&(pd.merge(df2, df1, left_on='GROUP_ID', right_on='group_id')['ENDDATE'] >= pd.merge(df2, df1, left_on='GROUP_ID', right_on='group_id')['date']))]

The first 2 I tried give me an error

ValueError: Can only compare identically-labeled DataFrame objects

the merge works but it also drops all the rows that don't match. I think I can work around this with another merge but if there is a faster more streamlined way I would like to use it.

Upvotes: 1

Views: 84

Answers (2)

MrKingsley
MrKingsley

Reputation: 202

It seems this question is a duplicate, or at the least closely related to this one here here. From what I gained engaging with the member I implemented this solution.

import pandas as pd

def lookup_value(groupid, date1, date2, lookup):
    if len([x for x in lookup[lookup['group_id'].astype(int) == int(groupid)]['date'] if pd.to_datetime(x) >= pd.to_datetime(date1) and pd.to_datetime(x) <= pd.to_datetime(date2)])==0:
        return ['None']
    else:
        match = ((lookup['group_id'].astype(int) == int(groupid)) & (pd.to_datetime(lookup['date'])>=pd.to_datetime(date1)) & (pd.to_datetime(lookup['date'])<=pd.to_datetime(date2)))
        value = lookup['value'][match]
        return value.values

if __name__ == '__main__':
    lookup_cols = ['group_id','date','value']
    lookup_data = [['105716','1/30/2019','Soccer'],
    ['105717','1/30/2019','Football'],
    ['105718','1/30/2019','Rest'],
    ['105719','1/30/2019','Soccer'],
    ['105716','1/31/2019','Rest'],
    ['105717','1/31/2019','Rest'],
    ['105718','02/01/2019','Football'],
    ['105719','02/01/2019','Soccer'],
    ['105719','02/02/2019','Tennis'],
    ['105722','02/03/2019','Tennis']]

    parent_cols = ['GROUP_ID','STARTDATE','ENDDATE']
    parent_data = [['105716','1/30/2019','1/30/2019'],
    ['105717','1/30/2019','1/30/2019'],
    ['105718','1/30/2019','1/30/2019'],
    ['105719','1/30/2019','1/30/2019'],
    ['105716','1/30/2019','1/31/2019'],
    ['105717','1/31/2019','1/31/2019'],
    ['105718','1/31/2019','1/31/2019'],
    ['105719','1/31/2019','1/31/2019'],
    ['105716','1/31/2019','1/31/2019'],
    ['105717','1/31/2019','1/31/2019'],
    ['105718','1/31/2019','1/31/2019'],
    ['105719','1/31/2019','2/1/2019'],
    ['105716','2/1/2019','2/1/2019'],
    ['105717','2/1/2019','2/1/2019'],
    ['105718','2/1/2019','2/1/2019'],
    ['105719','2/1/2019','2/1/2019'],
    ['105716','2/1/2019','2/1/2019'],
    ['105717','2/1/2019','2/1/2019'],
    ['105718','2/1/2019','2/1/2019'],
    ['105719','2/1/2019','2/1/2019'],
    ['105716','2/1/2019','2/2/2019'],
    ['105717','2/2/2019','2/2/2019'],
    ['105718','2/2/2019','2/2/2019'],
    ['105719','2/2/2019','2/2/2019'],
    ['105716','2/2/2019','2/2/2019'],
    ['105717','2/2/2019','2/2/2019'],
    ['105718','2/2/2019','2/2/2019'],
    ['105719','2/2/2019','2/3/2019'],
    ['105716','2/3/2019','2/3/2019'],
    ['105722','2/3/2019','2/3/2019']]

    parent_df = pd.DataFrame(data=parent_data, columns=parent_cols)
    lookup_df = pd.DataFrame(data=lookup_data, columns=lookup_cols)

    parent_df['VALUE'] = parent_df.apply(lambda x: lookup_value(x['GROUP_ID'], x['STARTDATE'], x['ENDDATE'], lookup_df)[0], axis=1)
    print(parent_df)

Output:

GROUP_ID  STARTDATE    ENDDATE     VALUE
0    105716  1/30/2019  1/30/2019    Soccer
1    105717  1/30/2019  1/30/2019  Football
2    105718  1/30/2019  1/30/2019      Rest
3    105719  1/30/2019  1/30/2019    Soccer
4    105716  1/30/2019  1/31/2019    Soccer
5    105717  1/31/2019  1/31/2019      Rest
6    105718  1/31/2019  1/31/2019      None
7    105719  1/31/2019  1/31/2019      None
8    105716  1/31/2019  1/31/2019      Rest
9    105717  1/31/2019  1/31/2019      Rest
10   105718  1/31/2019  1/31/2019      None
11   105719  1/31/2019   2/1/2019    Soccer
12   105716   2/1/2019   2/1/2019      None
13   105717   2/1/2019   2/1/2019      None
14   105718   2/1/2019   2/1/2019  Football
15   105719   2/1/2019   2/1/2019    Soccer
16   105716   2/1/2019   2/1/2019      None
17   105717   2/1/2019   2/1/2019      None
18   105718   2/1/2019   2/1/2019  Football
19   105719   2/1/2019   2/1/2019    Soccer
20   105716   2/1/2019   2/2/2019      None
21   105717   2/2/2019   2/2/2019      None
22   105718   2/2/2019   2/2/2019      None
23   105719   2/2/2019   2/2/2019    Tennis
24   105716   2/2/2019   2/2/2019      None
25   105717   2/2/2019   2/2/2019      None
26   105718   2/2/2019   2/2/2019      None
27   105719   2/2/2019   2/3/2019    Tennis
28   105716   2/3/2019   2/3/2019      None
29   105722   2/3/2019   2/3/2019    Tennis

Process finished with exit code 0

Upvotes: 0

Corralien
Corralien

Reputation: 120409

merge first on group_id then query to filter out your list by dates:

out = (df2.merge(df1, on='group_id')
          .query("(startdate <= date) & (date <= enddate)"))
print(out)

# Output:
   group_id  startdate    enddate       date  value
0         1 2023-01-15 2023-01-20 2023-01-17    100

Input dataframes:

>>> df1
   group_id       date  value
0         1 2023-01-17    100
1         1 2023-01-30    200

>>> df2
   group_id  startdate    enddate
0         1 2023-01-15 2023-01-20

Upvotes: 1

Related Questions