Reputation: 11
I have two dataframes and I'm trying to find if a partial string (filename) from one dataframe g exists in a full string (full filename) in dataframe d and update a match column in the original dataframe g.
g = pd.DataFrame([['c:\\ythisFile.pdf', 'thisFile.pdf'], ['c:\\ythatFile.exe', 'thatFile.exe'], ['c:\\ytheOtherFile.zip', 'theOtherFile.zip']], columns=['FULL_FILE', 'FILENAME'])
d = pd.DataFrame([['c:\\zthis_File.pdf', 'this_File.pdf'], ['c:\\zthatFile.exe', 'thatFile.exe'], ['c:\\ztheOtherFile.zip', 'ssss.zip']], columns=['FULL_FILE', 'FILENAME'])
e.g. I'm essentially trying to do a look up of d.FILENAME in g.FULL_FILE
I've tried d[d.FILENAME.isin(g.FULL_FILE)]
but this gives an error, I assume because it's looking for a full match
I've tried the below but that seems to match more than all of g.FILENAME in d.FULL_FILE. g = g.merge(d, left_on = g.FILENAME.str.extract('(\d+)', expand = False), right_on = d.FULL_FILE.str.extract('(\d+)', expand = False), how = 'inner')
The overall aim is to:-
1. Match the g.FULL_FILE against d.FULL_FILE column.
2. If no match, match g.FILENAME if it exists as partial string match in d.FULL_FILE column
3. If no match still, check last 10 chars for match from g.FILENAME column in d.FULL_FILE column (incase there are special characters in the g.FULL_FILE)
Please help. I've done hours of research and can find solutions for some similar solutions but not exactly this and having trouble adapting those to this problem.
Upvotes: 0
Views: 87
Reputation: 77404
I am not able to fully understand your desired outcome, but here is my best shot at the first two steps you mention. (I don't understand your comment about checking the final 10 chars or how it relates to special characters, so I am skipping that part.)
In [80]: g
Out[80]:
FULL_FILE FILENAME
0 c:\ythisFile.pdf thisFile.pdf
1 c:\ythatFile.exe thatFile.exe
2 c:\ytheOtherFile.zip theOtherFile.zip
In [81]: d
Out[81]:
FULL_FILE FILENAME
0 c:\zthis_File.pdf this_File.pdf
1 c:\zthatFile.exe thatFile.exe
2 c:\ztheOtherFile.zip ssss.zip
In [82]: temp1 = pd.merge(
g,
d,
on='FULL_FILE',
how='left',
suffixes=('_g', '_d')
)
In [83]: temp1
Out[83]:
FULL_FILE FILENAME_g FILENAME_d
0 c:\ythisFile.pdf thisFile.pdf NaN
1 c:\ythatFile.exe thatFile.exe NaN
2 c:\ytheOtherFile.zip theOtherFile.zip NaN
In [84]: step2 = d.FULL_FILE.map(
lambda x: temp1.loc[temp1.FILENAME_d.isnull()].FILENAME_g.map(
lambda y: y in x
).any()
)
In [85]: step2
Out[85]:
0 False
1 True
2 True
Name: FULL_FILE, dtype: bool
In [86]: temp2 = pandas.merge(
temp1,
d.loc[step2].drop('FULL_FILE', axis=1),
left_index=True,
right_index=True,
how='left'
)
In [87]: temp2
Out[87]:
FULL_FILE FILENAME_g FILENAME_d FILENAME
0 c:\ythisFile.pdf thisFile.pdf NaN NaN
1 c:\ythatFile.exe thatFile.exe NaN thatFile.exe
2 c:\ytheOtherFile.zip theOtherFile.zip NaN ssss.zip
In [88]: temp2['FILENAME_d'] = temp2['FILENAME_d'].fillna(temp2.FILENAME)
In [89]:temp2.drop('FILENAME', axis=1)
Out[89]:
FULL_FILE FILENAME_g FILENAME_d
0 c:\ythisFile.pdf thisFile.pdf NaN
1 c:\ythatFile.exe thatFile.exe thatFile.exe
2 c:\ytheOtherFile.zip theOtherFile.zip ssss.zip
Note that this also works for the case when there is actually a match in the first step. For example, if I add such an entry to your example data:
In [135]: def fuzzy_match(g, d):
...: temp1 = pd.merge(
...: g,
...: d,
...: on='FULL_FILE',
...: how='left',
...: suffixes=('_g', '_d')
...: )
...: step2 = d.FULL_FILE.map(
...: lambda x: temp1.loc[temp1.FILENAME_d.isnull()].FILENAME_g.map(
...: lambda y: y in x
...: ).any()
...: )
...: temp2 = pd.merge(
...: temp1,
...: d.loc[step2].drop('FULL_FILE', axis=1),
...: left_index=True,
...: right_index=True,
...: how='left'
...: )
...: temp2['FILENAME_d'] = temp2['FILENAME_d'].fillna(temp2.FILENAME)
...: return temp2.drop('FILENAME', axis=1)
...:
...:
In [136]: g = pd.DataFrame([['c:\\aFile.txt', 'aFile.txt'], ['c:\\ythisFile.pdf', 'thisFile.pdf'], ['c:\\ythatFile.exe', 'thatFile.exe'], ['c:\\ytheOtherFile.zip', '
...: theOtherFile.zip']], columns=['FULL_FILE', 'FILENAME']); d = pd.DataFrame([['c:\\aFile.txt', 'aFile.txt'], ['c:\\zthis_File.pdf', 'this_File.pdf'], ['c:\\z
...: thatFile.exe', 'thatFile.exe'], ['c:\\ztheOtherFile.zip', 'ssss.zip']], columns=['FULL_FILE', 'FILENAME'])
In [137]: g
Out[137]:
FULL_FILE FILENAME
0 c:\aFile.txt aFile.txt
1 c:\ythisFile.pdf thisFile.pdf
2 c:\ythatFile.exe thatFile.exe
3 c:\ytheOtherFile.zip theOtherFile.zip
In [138]: d
Out[138]:
FULL_FILE FILENAME
0 c:\aFile.txt aFile.txt
1 c:\zthis_File.pdf this_File.pdf
2 c:\zthatFile.exe thatFile.exe
3 c:\ztheOtherFile.zip ssss.zip
In [139]: fuzzy_match(g, d)
Out[139]:
FULL_FILE FILENAME_g FILENAME_d
0 c:\aFile.txt aFile.txt aFile.txt
1 c:\ythisFile.pdf thisFile.pdf NaN
2 c:\ythatFile.exe thatFile.exe thatFile.exe
3 c:\ytheOtherFile.zip theOtherFile.zip ssss.zip
Upvotes: 1