Reputation: 323
I have a reference dataframe called ref
that looks like that:
ref = pd.DataFrame({'col1':['a','b','c','d'],
'col2':[100,100,100,300],
'col3':[200,200,500,400],
'col4':[300,np.NaN,600,600],
'col5':[400,np.NaN, 700,700]})
col1 col2 col3 col4 col5
0 a 100 200 300 400
1 b 100 200 NaN NaN
2 c 100 500 600 700
3 d 300 400 600 700
And I have another dataframe called df
that looks like:
df = pd.DataFrame({'col2':[105,100,100,200,300],
'col3':[202,200,200,300,391],
'col4':[300,np.NaN,350,400,605],
'col5':[398,np.NaN, 450,500,701]})
col2 col3 col4 col5
0 105 202 300 398
1 100 200 NaN NaN
2 100 200 350 450
3 200 300 400 500
4 300 391 605 701
What I am trying to do is to compare each row of df
with the rows of ref
in order to determine if they correspond to a, b, c, d
or None
. And because the values in df
can be slightly different than the ones in ref
, I would like to allow some 'error' so that if the difference between two values is not greater than 10, we allow a match.
In the end, I would like to obtain something like:
col2 col3 col4 col5 id
0 105 202 300 398 a
1 100 200 NaN NaN b
2 100 200 350 450 none
3 200 300 400 500 none
4 300 391 605 701 d
The way I see it, the idea is to start with values in df['col2']
, verify if it matches with any rows in ref['col2']
and, for the ones where it matches, verify if df['col3']
also match with ref['col3']
and so on and so on. And once we identified (or not) the only row matching in ref
, return the identifier written in ref['col1]
.
I would need to have something "efficient" if possible, as my original dataset can have up to 135 columns. But I assume, as I stop focusing on rows in ref
that doesn't match, it becomes easier to work on the ones that are left ?
I just started coding in python and using pandas on my own and I admit I am a bit lost. I'm not necessarily asking for the final code but could anyone send me in the good direction ?
I started coding this:
for read in df.itertuples():
for ref in ref.itertuples():
if abs(read[1]-ref[2]) <= 10:
print(str(read[0])+' match '+str(ref[1]))
else:
print(str(read[0])+' match None')
But I didn't go very far..
First, I don't know how to "lock" the rows matching, so that I'm able to work exclusively on them for further comparisons.
Seconds, it returns "none" for every rows not matching, instead of only returning one line if none matches (which I assume is because I am using itertuples()
)
Thanks for reading all this and thanks in advance for any tips !
Edit:
Here's the piece of code I have right now. I'm using np.isclose()
for comparisons with a tolerance. but I'm sure there's a way to avoid writing each 'if' conditions for each comparisons.. (especially with 135 columns in my final DataFrame)
for read in df.itertuples():
for ref in df2.itertuples():
if np.isclose(read[1],ref[2],atol=10, equal_nan=True) == True:
if np.isclose(read[2],ref[3],atol=10, equal_nan=True) == True:
if np.isclose(read[3],ref[4],atol=10, equal_nan=True) == True:
if np.isclose(read[4],ref[5],atol=10, equal_nan=True) == True:
print(str(read[0])+' match '+str(ref[1]))
Upvotes: 1
Views: 1085
Reputation: 29635
Here is a way easy to set up with only 5 columns, not sure for up to 135 but it might give you an idea.
First replace the NaN
by 0 (or any value far from all the values you have), then create two columns "sup" and "inf" for each colX
depending on your error criteria in ref:
df = df.fillna(0)
ref = ref.fillna(0)
for nb in range(2,6):
ref['col' + str(nb) + 'sup'] = ref['col' + str(nb)] + 10
ref['col' + str(nb) + 'inf'] = ref['col' + str(nb)] -10
Now define a function find_match
to apply on each row of df
which will select the row of ref
matching:
def find_match ( row, ref):
row_ref = ref[(ref['col2sup'] >= row['col2']) & (ref['col2inf'] <= row['col2']) &
(ref['col3sup'] >= row['col3']) & (ref['col3inf'] <= row['col3']) &
(ref['col4sup'] >= row['col4']) & (ref['col4inf'] <= row['col4']) &
(ref['col5sup'] >= row['col5']) & (ref['col5inf'] <= row['col5'])]
if not row_ref.empty:
return row_ref['col1'].values[0]
else:
return None
Here it might be difficult to generalize to 135 columns.
Now you can create your column id
in df
with:
df['id'] = df.apply(find_match, args=([ref]),axis=1)
and you have your expected output
EDIT: to go up to N columns, you can replace in find_match
the line row_ref
by:
row_ref = ref.copy()
for i in range(2,6):
row_ref = row_ref[(ref['col'+str(i)+'sup'] >= row['col'+str(i)]) &
(ref['col'+str(i)+'inf'] <= row['col'+str(i)])]
and change the 6 to 136 for example. But I'm not sure about the efficiency of this method on large data.
EDIT2: so with np.isclose
, you can do something like:
# definition of the function, change the 4 to the right number
def find_match ( row, df_ref):
ser = df_ref[pd.np.isclose(row.iloc[:4], df_ref.iloc[:,1:], atol=0.1).all(axis=1)]['col1']
if not ser.empty:
return ser.values[0]
else:
return None
# and then apply
df['id'] = df.apply(find_match , args=([ref]),axis=1)
EDIT3: to not compare all the row if Nan
:
def find_match ( row, df_ref):
col_nan = row[row.isnull()].index
if not col_nan.empty:
ind_col = row.index.get_loc(col_nan[0])+1
else:
ind_col = 4 # number of column to compare if no nan in the row
ser = df_ref[pd.np.isclose(row.iloc[:ind_col], df_ref.iloc[:,1:ind_col+1], atol=10, equal_nan=True).all(axis=1)]['col1']
if not ser.empty:
return ser.values[0]
else:
return None
Note: you may find another way to get the ind_col
but this should work. Let me know if it's really faster (it depends on how much rows have nan
and how many I guess).
Upvotes: 1