Andy
Andy

Reputation: 3170

Partial String Matching between Pandas DataFrames

I've seen discussions on partial string match merges here, here, and elsewhere, but nothing on how to do it on the table containing the superstring instead of the substring.

Given sample data like

df1 = pd.DataFrame({'uri': ['http://www.foo.com/index', 
                            'https://bar.net/directory', 
                            'www.baz.gov/aboutus']})
df2 = pd.DataFrame({'fqdn': ['www.foo.com',
                             'www.qux.mil']})
print(df1)

                         uri
0   http://www.foo.com/index
1  https://bar.net/directory
2        www.baz.gov/aboutus
print(df2)

          fqdn
0  www.foo.com
1  www.qux.mil

my end goal is to remove the rows from df1 that contain any substring in df2. In the actual data, df1 has several million rows and df2 has a few hundred, and any given row in df1 will have at most one substring in df2.

Given the sample data, I hope to end up with a dataframe like

                         uri
0  https://bar.net/directory
1        www.baz.gov/aboutus

and by my logic, the intermediate step is to generate

                         uri           fqdn
0   http://www.foo.com/index    www.foo.com
1  https://bar.net/directory         np.NaN
2        www.baz.gov/aboutus         np.NaN

but I can't figure out how to check all of df2's values inside of df1.apply().


Edit:

While both answers below work, I got the fastest results in my specific scenario by compiling a regex object and using extract:

import re

fqdn_list= re.compile(f"({'|'.join(df2.fqdn)})")

df1['fqdn'] = df1.uri.str.extract(fqdn_list)

Upvotes: 1

Views: 277

Answers (2)

Erfan
Erfan

Reputation: 42886

Concatenate the strings from df2 with the pipe (|) as seperator which is the or operator in regex. This way we can check if the df1 contains any of these strings and remove them with str.contains and ~ which is the not operator;

m = ~df1['uri'].str.contains('|'.join(df2['fqdn']))
df1[m]

Output

                         uri
1  https://bar.net/directory
2        www.baz.gov/aboutus

Upvotes: 1

BENY
BENY

Reputation: 323226

Is this what you need ? str.findall

df1.uri.str.findall(df2.fqdn.str.cat(sep='|')).str[0]
Out[192]: 
0    www.foo.com
1            NaN
2            NaN
Name: uri, dtype: object
#df1['fqdn']=df1.uri.str.findall(df2.fqdn.str.cat(sep='|')).str[0]

Upvotes: 2

Related Questions