HiAlllllll
HiAlllllll

Reputation: 63

Python find partial matches of two data frame columns of different lengths

I have two data frame columns of different lengths. DF1:

filename
b~cute
b~cute2
nan
b~cat
b~dog
nan

DF2:

filename
sgjsg~12345~b~cute~v4.jpeg
jgsgj~1233~b~dog~v4.jpeg
jhdjd~1252~b~cat~v4.jpeg
sggsn~1252~b~cute2~v4.jpeg

I am trying to do a partial match to create a separate column in DF1 with the filename as such

filename filepath
b~cute sgjsg~12345~b~cute~v4.jpeg
b~cute2 sggsn~1252~b~cute2~v4.jpeg
nan nan
b~cat jhdjd~1252~b~cat~v4.jpeg
b~dog jgsgj~1233~b~dog~v4.jpeg
nan nan

Upvotes: 2

Views: 68

Answers (2)

Corralien
Corralien

Reputation: 120559

You can use a regex:

import re

# Create pattern and escape regex
pat = (df1['filename'].dropna().sort_values(ascending=False)
                      .map(re.escape).str.cat(sep='|'))

match = df2['filename'].str.extract(f"({pat})", expand=False)
out = df1.merge(df2.rename(columns={'filename': 'filepath'})
                   .assign(filename=match), on='filename', how='left')

Output:

>>> out
  filename                    filepath
0   b~cute  sgjsg~12345~b~cute~v4.jpeg
1  b~cute2  sggsn~1252~b~cute2~v4.jpeg
2      NaN                         NaN
3    b~cat    jhdjd~1252~b~cat~v4.jpeg
4    b~dog    jgsgj~1233~b~dog~v4.jpeg
5      NaN                         NaN

Details:

>>> pat
'b\\~dog|b\\~cute2|b\\~cute|b\\~cat'

Upvotes: 1

iuvbio
iuvbio

Reputation: 649

Make a helper function and use it to create the column on DF1.

def match_fn(fn, filenames):
    if not isinstance(fn, str):
        return None
    for filename in filenames:
        if fn in filename:
            return filename
    return None

df1.loc[:, "filepath"] = df1.filename.apply(lambda fn: match_fn(fn, df2.filename.values))

Upvotes: 1

Related Questions