Ben
Ben

Reputation: 11

Matching string from one dataframe which is a partial string in the 2nd dataframe

I have two dataframes and I'm trying to find if a partial string (filename) from one dataframe g exists in a full string (full filename) in dataframe d and update a match column in the original dataframe g.

g = pd.DataFrame([['c:\\ythisFile.pdf', 'thisFile.pdf'], ['c:\\ythatFile.exe', 'thatFile.exe'], ['c:\\ytheOtherFile.zip', 'theOtherFile.zip']], columns=['FULL_FILE', 'FILENAME'])

d = pd.DataFrame([['c:\\zthis_File.pdf', 'this_File.pdf'], ['c:\\zthatFile.exe', 'thatFile.exe'], ['c:\\ztheOtherFile.zip', 'ssss.zip']], columns=['FULL_FILE', 'FILENAME'])

e.g. I'm essentially trying to do a look up of d.FILENAME in g.FULL_FILE

I've tried d[d.FILENAME.isin(g.FULL_FILE)] but this gives an error, I assume because it's looking for a full match

I've tried the below but that seems to match more than all of g.FILENAME in d.FULL_FILE. g = g.merge(d, left_on = g.FILENAME.str.extract('(\d+)', expand = False), right_on = d.FULL_FILE.str.extract('(\d+)', expand = False), how = 'inner')

The overall aim is to:- 1. Match the g.FULL_FILE against d.FULL_FILE column.
2. If no match, match g.FILENAME if it exists as partial string match in d.FULL_FILE column 3. If no match still, check last 10 chars for match from g.FILENAME column in d.FULL_FILE column (incase there are special characters in the g.FULL_FILE)

Please help. I've done hours of research and can find solutions for some similar solutions but not exactly this and having trouble adapting those to this problem.

Upvotes: 0

Views: 87

Answers (1)

ely
ely

Reputation: 77404

I am not able to fully understand your desired outcome, but here is my best shot at the first two steps you mention. (I don't understand your comment about checking the final 10 chars or how it relates to special characters, so I am skipping that part.)

In [80]: g
Out[80]:
              FULL_FILE          FILENAME
0      c:\ythisFile.pdf      thisFile.pdf
1      c:\ythatFile.exe      thatFile.exe
2  c:\ytheOtherFile.zip  theOtherFile.zip

In [81]: d
Out[81]:
              FULL_FILE       FILENAME
0     c:\zthis_File.pdf  this_File.pdf
1      c:\zthatFile.exe   thatFile.exe
2  c:\ztheOtherFile.zip       ssss.zip

In [82]: temp1 = pd.merge(
    g, 
    d, 
    on='FULL_FILE', 
    how='left', 
    suffixes=('_g', '_d')
)

In [83]: temp1
Out[83]:
              FULL_FILE        FILENAME_g FILENAME_d
0      c:\ythisFile.pdf      thisFile.pdf        NaN
1      c:\ythatFile.exe      thatFile.exe        NaN
2  c:\ytheOtherFile.zip  theOtherFile.zip        NaN

In [84]: step2 = d.FULL_FILE.map(
    lambda x: temp1.loc[temp1.FILENAME_d.isnull()].FILENAME_g.map(
        lambda y: y in x
    ).any()
)

In [85]: step2
Out[85]:
0    False
1     True
2     True
Name: FULL_FILE, dtype: bool

In [86]: temp2 = pandas.merge(
    temp1, 
    d.loc[step2].drop('FULL_FILE', axis=1), 
    left_index=True, 
    right_index=True, 
    how='left'
)

In [87]: temp2
Out[87]:
              FULL_FILE        FILENAME_g FILENAME_d      FILENAME
0      c:\ythisFile.pdf      thisFile.pdf        NaN           NaN
1      c:\ythatFile.exe      thatFile.exe        NaN  thatFile.exe
2  c:\ytheOtherFile.zip  theOtherFile.zip        NaN      ssss.zip                  

In [88]: temp2['FILENAME_d'] = temp2['FILENAME_d'].fillna(temp2.FILENAME)

In [89]:temp2.drop('FILENAME', axis=1)
Out[89]:
              FULL_FILE        FILENAME_g    FILENAME_d
0      c:\ythisFile.pdf      thisFile.pdf           NaN
1      c:\ythatFile.exe      thatFile.exe  thatFile.exe
2  c:\ytheOtherFile.zip  theOtherFile.zip      ssss.zip

Note that this also works for the case when there is actually a match in the first step. For example, if I add such an entry to your example data:

In [135]: def fuzzy_match(g, d):
     ...:     temp1 = pd.merge(
     ...:         g,
     ...:         d,
     ...:         on='FULL_FILE',
     ...:         how='left',
     ...:         suffixes=('_g', '_d')
     ...:     )
     ...:     step2 = d.FULL_FILE.map(
     ...:         lambda x: temp1.loc[temp1.FILENAME_d.isnull()].FILENAME_g.map(
     ...:             lambda y: y in x
     ...:         ).any()
     ...:     )
     ...:     temp2 = pd.merge(
     ...:         temp1,
     ...:         d.loc[step2].drop('FULL_FILE', axis=1),
     ...:         left_index=True,
     ...:         right_index=True,
     ...:         how='left'
     ...:     )
     ...:     temp2['FILENAME_d'] = temp2['FILENAME_d'].fillna(temp2.FILENAME)
     ...:     return temp2.drop('FILENAME', axis=1)
     ...:
     ...:


In [136]: g = pd.DataFrame([['c:\\aFile.txt', 'aFile.txt'], ['c:\\ythisFile.pdf', 'thisFile.pdf'], ['c:\\ythatFile.exe', 'thatFile.exe'], ['c:\\ytheOtherFile.zip', '
     ...: theOtherFile.zip']], columns=['FULL_FILE', 'FILENAME']); d = pd.DataFrame([['c:\\aFile.txt', 'aFile.txt'], ['c:\\zthis_File.pdf', 'this_File.pdf'], ['c:\\z
     ...: thatFile.exe', 'thatFile.exe'], ['c:\\ztheOtherFile.zip', 'ssss.zip']], columns=['FULL_FILE', 'FILENAME'])

In [137]: g
Out[137]:
              FULL_FILE          FILENAME
0          c:\aFile.txt         aFile.txt
1      c:\ythisFile.pdf      thisFile.pdf
2      c:\ythatFile.exe      thatFile.exe
3  c:\ytheOtherFile.zip  theOtherFile.zip

In [138]: d
Out[138]:
              FULL_FILE       FILENAME
0          c:\aFile.txt      aFile.txt
1     c:\zthis_File.pdf  this_File.pdf
2      c:\zthatFile.exe   thatFile.exe
3  c:\ztheOtherFile.zip       ssss.zip

In [139]: fuzzy_match(g, d)
Out[139]:
              FULL_FILE        FILENAME_g    FILENAME_d
0          c:\aFile.txt         aFile.txt     aFile.txt
1      c:\ythisFile.pdf      thisFile.pdf           NaN
2      c:\ythatFile.exe      thatFile.exe  thatFile.exe
3  c:\ytheOtherFile.zip  theOtherFile.zip      ssss.zip 

Upvotes: 1

Related Questions