vagabond
vagabond

Reputation: 3594

part of a string contained in another string regex python

Is there a way to check if any part of a string matches with another string in python?

For e.g.: I have URLs which look like this

url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})

and I have strings which look like:

string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
string = '|'.join(string_list)

I would like to match string with url.

Anastasia Beverly Hills with www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA and

www.ulta.com/beautyservices/benefitbrowbar/ with Benefit Cosmetics.

I've been trying url['urls'].str.contains('('+string+')', case = False) but this does not match.

What;s the correct way to do this?

Upvotes: 0

Views: 149

Answers (1)

Dan-Dev
Dan-Dev

Reputation: 9430

I can't do it as a regex in one line but here is my attempt using itertools and any:

import pandas as pd
from itertools import product

url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']

"""
For each of Cartesian product (the different combinations) of 
string_list and urls.
"""
for x in list(product(string_list, url['urls'])):
    """
    If any of the words in the string (x[0]) are present in 
    the URL (x[1]) disregarding case.
    """
    if any (word.lower() in x[1].lower() for word in x[0].split()):
        """
        Show the match.
        """
        print ("Match String: %s URL: %s" % (x[0], x[1])) 

Outputs:

Match String: Benefit Cosmetics URL: www.ulta.com/beautyservices/benefitbrowbar/
Match String: Anastasia Beverly Hills URL: www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA

Updated:

The way you were looking at it you could alternatively use:

import pandas as pd
import warnings
pd.set_option('display.width', 100)
"""
Supress the warning it will give on a match.
"""
warnings.filterwarnings("ignore", 'This pattern has match groups')
string_list = ['Benefit Cosmetics', 'Anastasia Beverly Hills']
"""
Create a pandas DataFrame.
"""
url = pd.DataFrame({'urls' : ['www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00GI21NZA', 'www.ulta.com/beautyservices/benefitbrowbar/']})
"""
Using one string at a time.
"""
for string in string_list:
    """
    Get the individual words in the string and concatenate them 
    using a pipe to create a regex pattern. 
    """
    s = "|".join(string.split())
    """
    Update the DataFrame with True or False where the regex 
    matches the URL.
    """
    url[string] = url['urls'].str.contains('('+s+')', case = False)
"""
Show the result
"""
print (url)

which would output:

                                                urls Benefit Cosmetics Anastasia Beverly Hills
0  www.amazon.com/ANASTASIA-Beverly...Brow/dp/B00...             False                    True
1        www.ulta.com/beautyservices/benefitbrowbar/              True                   False

Which I guess, if you want it in a DataFrame, may be better but I prefer the first way.

Upvotes: 1

Related Questions