Comparing values in two dataframes

Question

I have a reference dataframe called ref that looks like that:

ref = pd.DataFrame({'col1':['a','b','c','d'],
                  'col2':[100,100,100,300],
                  'col3':[200,200,500,400],
                  'col4':[300,np.NaN,600,600],
                  'col5':[400,np.NaN, 700,700]})

        col1    col2    col3    col4    col5
    0     a     100     200     300     400
    1     b     100     200     NaN     NaN
    2     c     100     500     600     700
    3     d     300     400     600     700

And I have another dataframe called df that looks like:

df = pd.DataFrame({'col2':[105,100,100,200,300],
                  'col3':[202,200,200,300,391],
                  'col4':[300,np.NaN,350,400,605],
                  'col5':[398,np.NaN, 450,500,701]})

    col2    col3    col4    col5
0   105     202     300     398
1   100     200     NaN     NaN
2   100     200     350     450
3   200     300     400     500
4   300     391     605     701

What I am trying to do is to compare each row of df with the rows of ref in order to determine if they correspond to a, b, c, d or None. And because the values in df can be slightly different than the ones in ref , I would like to allow some 'error' so that if the difference between two values is not greater than 10, we allow a match.

In the end, I would like to obtain something like:

    col2    col3    col4    col5   id
0   105     202     300     398    a
1   100     200     NaN     NaN    b
2   100     200     350     450    none
3   200     300     400     500    none
4   300     391     605     701    d

The way I see it, the idea is to start with values in df['col2'], verify if it matches with any rows in ref['col2'] and, for the ones where it matches, verify if df['col3'] also match with ref['col3']and so on and so on. And once we identified (or not) the only row matching in ref, return the identifier written in ref['col1].

I would need to have something "efficient" if possible, as my original dataset can have up to 135 columns. But I assume, as I stop focusing on rows in ref that doesn't match, it becomes easier to work on the ones that are left ?

I just started coding in python and using pandas on my own and I admit I am a bit lost. I'm not necessarily asking for the final code but could anyone send me in the good direction ?

I started coding this:

for read in df.itertuples():
    for ref in ref.itertuples():
        if abs(read[1]-ref[2]) <= 10:
            print(str(read[0])+' match '+str(ref[1]))
        else:
            print(str(read[0])+' match None')

But I didn't go very far.. First, I don't know how to "lock" the rows matching, so that I'm able to work exclusively on them for further comparisons. Seconds, it returns "none" for every rows not matching, instead of only returning one line if none matches (which I assume is because I am using itertuples() )

Thanks for reading all this and thanks in advance for any tips !

Edit:

Here's the piece of code I have right now. I'm using np.isclose() for comparisons with a tolerance. but I'm sure there's a way to avoid writing each 'if' conditions for each comparisons.. (especially with 135 columns in my final DataFrame)

for read in df.itertuples():
    for ref in df2.itertuples():
        if np.isclose(read[1],ref[2],atol=10, equal_nan=True) == True:
            if np.isclose(read[2],ref[3],atol=10, equal_nan=True) == True:
                if np.isclose(read[3],ref[4],atol=10, equal_nan=True) == True:
                    if np.isclose(read[4],ref[5],atol=10, equal_nan=True) == True:
                        print(str(read[0])+' match '+str(ref[1]))

Ben.T · Accepted Answer

Here is a way easy to set up with only 5 columns, not sure for up to 135 but it might give you an idea.

First replace the NaN by 0 (or any value far from all the values you have), then create two columns "sup" and "inf" for each colX depending on your error criteria in ref:

df = df.fillna(0)
ref = ref.fillna(0)
for nb in range(2,6):
    ref['col' + str(nb) + 'sup'] = ref['col' + str(nb)] + 10
    ref['col' + str(nb) + 'inf'] = ref['col' + str(nb)]  -10

Now define a function find_match to apply on each row of df which will select the row of ref matching:

def find_match ( row, ref):    
        row_ref = ref[(ref['col2sup'] >= row['col2']) & (ref['col2inf'] <= row['col2']) &
                      (ref['col3sup'] >= row['col3']) & (ref['col3inf'] <= row['col3']) &
                      (ref['col4sup'] >= row['col4']) & (ref['col4inf'] <= row['col4']) &
                      (ref['col5sup'] >= row['col5']) & (ref['col5inf'] <= row['col5'])]
        if not row_ref.empty:
            return row_ref['col1'].values[0]
        else:
            return None

Here it might be difficult to generalize to 135 columns.

Now you can create your column id in df with:

df['id'] = df.apply(find_match, args=([ref]),axis=1)

and you have your expected output

EDIT: to go up to N columns, you can replace in find_match the line row_ref by:

row_ref =  ref.copy()
for i in range(2,6):
    row_ref = row_ref[(ref['col'+str(i)+'sup'] >= row['col'+str(i)]) &
                      (ref['col'+str(i)+'inf'] <= row['col'+str(i)])]

and change the 6 to 136 for example. But I'm not sure about the efficiency of this method on large data.

EDIT2: so with np.isclose, you can do something like:

# definition of the function, change the 4 to the right number
def find_match ( row, df_ref):
    ser = df_ref[pd.np.isclose(row.iloc[:4], df_ref.iloc[:,1:], atol=0.1).all(axis=1)]['col1']
    if not ser.empty:
        return ser.values[0]
    else:
        return None
# and then apply
df['id'] = df.apply(find_match , args=([ref]),axis=1)

EDIT3: to not compare all the row if Nan:

def find_match ( row, df_ref):
    col_nan = row[row.isnull()].index
    if not col_nan.empty:
        ind_col = row.index.get_loc(col_nan[0])+1
    else:
        ind_col = 4 # number of column to compare if no nan in the row
    ser = df_ref[pd.np.isclose(row.iloc[:ind_col], df_ref.iloc[:,1:ind_col+1], atol=10, equal_nan=True).all(axis=1)]['col1']
    if not ser.empty:
        return ser.values[0]
    else:
        return None

Note: you may find another way to get the ind_col but this should work. Let me know if it's really faster (it depends on how much rows have nan and how many I guess).

Comparing values in two dataframes

Answers (1)

Related Questions