R overflow
R overflow

Reputation: 1352

Comparing Two DataFrames and Loop through them (to test a condition)

I am trying to 'join' two DataFrames based on a condition.

Condition

if df1.Year == df2.Year & 
df1.Date >= df2.BeginDate or df1.Date <= df2.EndDate &
df1.ID == df2.ID 
#if the condition is True, I would love to add an extra column (binary) to df1, something like
#df1.condition = Yes or No.

My data looks like this:

df1: 

Year     Week     ID   Date
2020      1      123   2020-01-01 00:00:00
2020      1      345   2020-01-01 00:00:00
2020      2      123   2020-01-07 00:00:00
2020      1      123   2020-01-01 00:00:00


df2: 

Year    BeginDate               EndDate                 ID
2020    2020-01-01  00:00:00    2020-01-02  00:00:00    123
2020    2020-01-01  00:00:00    2020-01-02  00:00:00    123
2020    2020-01-01  00:00:00    2020-01-02  00:00:00    978
2020    2020-09-21  00:00:00    2020-01-02  00:00:00    978


end_df: #Expected output 

Year     Week     ID   Condition
2020      1      123     True  #Year is matching, week1 is between the dates, ID is matching too 
2019      1      345     False #Year is not matching
2020      2      187     False # ID is not matching 
2020      1      123     True  # Same as first row. 

I thought to solve this by looping over two DataFrames:

for row in df1.iterrrows(): 
    for row2 in df2.iterrows(): 
         if row['Year'] == row2['Year2']:
              if row['ID] == row2['ID']: 
                  .....
                  .....
                   row['Condition'] = True 
         else: 
            row['Condition'] = False 

However... this is leading to error after error.

Really looking forward how you guys will tackle this problem. Many thanks in advance!

UPDATE 1

I created a loop. However, this loop is taking ages (and I am not sure how to add the value to a new column).

Note, in df1 I created a 'Date' column (in the same format as the begin & enddate from df2).

Key now: How can I add the True value (in the end of the loop..) to my df1 (in an extra column)?

for index, row in df1.interrows(): 
      row['Year'] = str(row['Year'])

      for index1, row1 in df2.iterrows():
          row1['Year'] = str(row1['Year'])


          if row['Year'] == row1['Year']:
                  row['ID'] = str(row['ID']) 
                  row1['ID'] = str(row1['ID']) 


                  if row['ID] == row1['ID']: 

                        if row['Date'] >= row1['BeginDate'] and row['Date'] <= row1['Enddate']:
                              print("I would like to add this YES to df1 in an extra column")

Edit 2

Trying @davidbilla solution: It looks like the 'condition' column is not doing well. As you can see, it match even while df1.Year != df2.Year. Note that df2 is sorted based on ID (so all the same unique numbers should be there

It looks like the 'condition' column is not doing well. As you can see, it match even while df1.Year != df2.Year. Note that df2 is sorted based on ID (so all the same unique numbers should be there

Upvotes: 2

Views: 6284

Answers (2)

Andy L.
Andy L.

Reputation: 25239

The 2nd row of the desired output has Year as 2019, so I assume the 2nd row of df1.Year is also 2019 instead of 2020

If I understand correctly, you need to merge and filter-out Date outside of the BeginDate and EndDate range. First, there are duplicates and invalid date ranges in df2. We need to drop duplicates and invalid ranges before merge. Invalid date ranges are ranges where BeginDate >= EndDate which is index 3 of df2.

#convert all date columns of both `df1` and `df2` to datetime dtype
df1['Date'] = pd.to_datetime(df1['Date'])
df2[['BeginDate', 'EndDate']] = df2[['BeginDate', 'EndDate']].apply(pd.to_datetime)

#left-merge on `Year`, `ID` and using `eval` to compute 
#columns `Condition` where `Date` is between `BeginDate` and `EndDate`. 
#Finally assign back to `df1`
df1['Condition'] = (df1.merge(df2.loc[df2.BeginDate < df2.EndDate].drop_duplicates(), 
                                on=['Year','ID'], how='left')
                       .eval('Condition= BeginDate <= Date <= EndDate')['Condition'])

Out[614]:
   Year  Week   ID       Date  Condition
0  2020     1  123 2020-01-01       True
1  2019     1  345 2020-01-01      False
2  2020     2  123 2020-01-07      False
3  2020     1  123 2020-01-01       True

Upvotes: 2

davidbilla
davidbilla

Reputation: 2222

I guess you are expecting something like this - if you are trying to match the dataframes row wise (i.e compare row1 of df1 with row1 of df2):

df1['condition'] = np.where((df1['Year']==df2['Year'])&(df1['ID']==df2['ID'])&((df1['Date']>=df2['BeginDate'])or(df1['Date']<=df2['EndDate'])), True, False)

np.where takes the conditions as the first parameter, the second parameter will the be the value if the condition pass, the 3rd parameter is the value if the condition fail.

EDIT 1: Based on your sample dataset

df1 = pd.DataFrame([[2020,1,123],[2020,1,345],[2020,2,123],[2020,1,123]],
                   columns=['Year','Week','ID'])
df2 = pd.DataFrame([[2020,'2020-01-01  00:00:00','2020-01-02  00:00:00',123],
                    [2020,'2020-01-01  00:00:00','2020-01-02  00:00:00',123],
                    [2020,'2020-01-01  00:00:00','2020-01-02  00:00:00',978],
                    [2020,'2020-09-21  00:00:00','2020-01-02  00:00:00',978]],
                   columns=['Year','BeginDate','EndDate','ID'])
df2['BeginDate'] = pd.to_datetime(df2['BeginDate'])
df2['EndDate'] = pd.to_datetime(df2['EndDate'])
df1['condition'] = np.where((df1['Year']==df2['Year'])&(df1['ID']==df2['ID']),True, False)
# &((df1['Date']>=df2['BeginDate'])or(df1['Date']<=df2['EndDate'])) - removed this condition as the df has no Date field
print(df1)

Output:

   Year  Date   ID  condition
0  2020     1  123       True
1  2020     1  345      False
2  2020     2  123      False
3  2020     1  123      False

EDIT 2: To compare one row in df1 with all rows in df2

df1['condition'] = (df1['Year'].isin(df2['Year']))&(df1['ID'].isin(df2['ID']))

This takes df1['Year'] and compares it against all values of df2['Year'].

Based on the sample dataset:

df1:

   Year       Date   ID  
0  2020 2020-01-01  123  
1  2020 2020-01-01  345  
2  2020 2020-10-01  123  
3  2020 2020-11-13  123  

df2:

   Year  BeginDate    EndDate   ID
0  2020 2020-01-01 2020-02-01  123
1  2020 2020-01-01 2020-01-02  123
2  2020 2020-03-01 2020-05-01  978
3  2020 2020-09-21 2020-10-01  978

Code change:

date_range = list(zip(df2['BeginDate'],df2['EndDate']))

def check_date(date):
    for (s,e) in date_range:
        if date>=s and date<=e:
            return True
    return False

df1['condition'] = (df1['Year'].isin(df2['Year']))&(df1['ID'].isin(df2['ID']))
df1['date_compare'] = df1['Date'].apply(lambda x: check_date(x)) # you can directly store this in df1['condition']. I just wanted to print the values so have used a new field
df1['condition'] = (df1['condition']==True)&(df1['date_compare']==True)

Output:

   Year       Date   ID  condition  date_compare
0  2020 2020-01-01  123       True          True    # Year match, ID match and Date is within the range of df2 row 1
1  2020 2020-01-01  345      False          True    # Year match, ID no match
2  2020 2020-10-01  123       True          True    # Year match, ID match, Date is within range of df2 row 4
3  2020 2020-11-13  123      False         False    # Year match, ID match, but Date is not in range of any row in df2

EDIT 3: Based on updated question (Earlier I thought it was ok if the 3 values year, id and date match df2 in any of the rows not on the same row). I think I got better understanding of your requirement now.

df2['BeginDate'] = pd.to_datetime(df2['BeginDate'])
df2['EndDate'] = pd.to_datetime(df2['EndDate'])
df1['Date'] = pd.to_datetime(df1['Date'])

df1['condition'] = False
for idx1, row1 in df1.iterrows():
    match = False
    for idx2, row2 in df2.iterrows():
        if (row1['Year']==row2['Year']) & \
                (row1['ID']==row2['ID']) & \
                (row1['Date']>=row2['BeginDate']) & \
                (row1['Date']<=row2['EndDate']):
            match = True
    df1.at[idx1, 'condition'] = match

Output - Set 1:

DF1:

   Year       Date   ID
0  2020 2020-01-01  123
1  2020 2020-01-01  123
2  2020 2020-01-01  345
3  2020 2020-01-10  123
4  2020 2020-11-13  123

DF2:

   Year  BeginDate    EndDate   ID
0  2020 2020-01-15 2020-02-01  123
1  2020 2020-01-01 2020-01-02  123
2  2020 2020-03-01 2020-05-01  978
3  2020 2020-09-21 2020-10-01  978

DF1 result:

   Year       Date   ID  condition
0  2020 2020-01-01  123       True
1  2020 2020-01-01  123       True
2  2020 2020-01-01  345      False
3  2020 2020-01-10  123      False
4  2020 2020-11-13  123      False

Output - Set 2: DF1:

   Year       Date        ID
0  2019 2019-01-01   s904112
1  2019 2019-01-01   s911243
2  2019 2019-01-01   s917131
3  2019 2019-01-01  sp986214
4  2019 2019-01-01   s510006
5  2020 2020-01-10   s540006

DF2:

   Year  BeginDate    EndDate       ID
0  2020 2020-01-27 2020-09-02  s904112
1  2020 2020-01-27 2020-09-02  s904112
2  2020 2020-01-03 2020-03-15  s904112
3  2020 2020-04-15 2020-01-05  s904112
4  2020 2020-01-05 2020-05-15  s540006
5  2019 2019-01-05 2019-05-15  s904112

DF1 Result:

   Year       Date        ID  condition
0  2019 2019-01-01   s904112      False
1  2019 2019-01-01   s911243      False
2  2019 2019-01-01   s917131      False
3  2019 2019-01-01  sp986214      False
4  2019 2019-01-01   s510006      False
5  2020 2020-01-10   s540006       True

Upvotes: 6

Related Questions