Reputation: 3594
Is there a way to check if any part of a string matches with another string in python?
For e.g.: I have URLs which look like this
url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
and I have strings which look like:
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
string = '|'.join(string_list)
I would like to match string
with url
.
Anastasia Beverly Hills
with www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA
and
www.ulta.com/beautyservices/benefitbrowbar/
with Benefit Cosmetics
.
I've been trying url['urls'].str.contains('('+string+')', case = False)
but this does not match.
What;s the correct way to do this?
Upvotes: 0
Views: 149
Reputation: 9430
I can't do it as a regex in one line but here is my attempt using itertools and any:
import pandas as pd
from itertools import product
url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
"""
For each of Cartesian product (the different combinations) of
string_list and urls.
"""
for x in list(product(string_list, url['urls'])):
"""
If any of the words in the string (x[0]) are present in
the URL (x[1]) disregarding case.
"""
if any (word.lower() in x[1].lower() for word in x[0].split()):
"""
Show the match.
"""
print ("Match String: %s URL: %s" % (x[0], x[1]))
Outputs:
Match String: Benefit Cosmetics URL: www.ulta.com/beautyservices/benefitbrowbar/
Match String: Anastasia Beverly Hills URL: www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA
Updated:
The way you were looking at it you could alternatively use:
import pandas as pd
import warnings
pd.set_option('display.width', 100)
"""
Supress the warning it will give on a match.
"""
warnings.filterwarnings("ignore", 'This pattern has match groups')
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
"""
Create a pandas DataFrame.
"""
url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
"""
Using one string at a time.
"""
for string in string_list:
"""
Get the individual words in the string and concatenate them
using a pipe to create a regex pattern.
"""
s = "|".join(string.split())
"""
Update the DataFrame with True or False where the regex
matches the URL.
"""
url[string] = url['urls'].str.contains('('+s+')', case = False)
"""
Show the result
"""
print (url)
which would output:
urls Benefit Cosmetics Anastasia Beverly Hills
0 www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00... False True
1 www.ulta.com/beautyservices/benefitbrowbar/ True False
Which I guess, if you want it in a DataFrame, may be better but I prefer the first way.
Upvotes: 1