Reputation: 3170
I've seen discussions on partial string match merges here, here, and elsewhere, but nothing on how to do it on the table containing the superstring instead of the substring.
Given sample data like
df1 = pd.DataFrame({'uri': ['http://www.foo.com/index',
'https://bar.net/directory',
'www.baz.gov/aboutus']})
df2 = pd.DataFrame({'fqdn': ['www.foo.com',
'www.qux.mil']})
print(df1)
uri
0 http://www.foo.com/index
1 https://bar.net/directory
2 www.baz.gov/aboutus
print(df2)
fqdn
0 www.foo.com
1 www.qux.mil
my end goal is to remove the rows from df1
that contain any substring in df2
. In the actual data, df1
has several million rows and df2
has a few hundred, and any given row in df1
will have at most one substring in df2
.
Given the sample data, I hope to end up with a dataframe like
uri
0 https://bar.net/directory
1 www.baz.gov/aboutus
and by my logic, the intermediate step is to generate
uri fqdn
0 http://www.foo.com/index www.foo.com
1 https://bar.net/directory np.NaN
2 www.baz.gov/aboutus np.NaN
but I can't figure out how to check all of df2
's values inside of df1.apply()
.
Edit:
While both answers below work, I got the fastest results in my specific scenario by compiling a regex object and using extract
:
import re
fqdn_list= re.compile(f"({'|'.join(df2.fqdn)})")
df1['fqdn'] = df1.uri.str.extract(fqdn_list)
Upvotes: 1
Views: 277
Reputation: 42886
Concatenate the strings from df2
with the pipe (|
) as seperator which is the or
operator in regex. This way we can check if the df1
contains any of these strings and remove them with str.contains
and ~
which is the not
operator;
m = ~df1['uri'].str.contains('|'.join(df2['fqdn']))
df1[m]
Output
uri
1 https://bar.net/directory
2 www.baz.gov/aboutus
Upvotes: 1
Reputation: 323226
Is this what you need ? str.findall
df1.uri.str.findall(df2.fqdn.str.cat(sep='|')).str[0]
Out[192]:
0 www.foo.com
1 NaN
2 NaN
Name: uri, dtype: object
#df1['fqdn']=df1.uri.str.findall(df2.fqdn.str.cat(sep='|')).str[0]
Upvotes: 2