JacobOV
JacobOV

Reputation: 25

Check if there is a substring that matches a string from a list

Bit of a beginner question here; I currently have a pandas df with one column containing various different strings. I have some more columns which are currently empty. Example of first few rows below;

Risk,Cost,Productivity,Security
"Unforeseen cost due to CCTV failures",,,
"Unexpected drop in Productivity",,,

I've also created a set of lists as follows;

Cost = ['Cost']
Productivity = ['Productivity']
Security = ['Security','CCTV','Camera']

Essentially what I want to do is I want to go through each column and check whether the string in the "Risk" column on the same row contains a substring that matches one of the strings in the list. The ideal output would be as follows;

Risk,Cost,Productivity,Security
"Unforeseen cost due to security issues",TRUE,FALSE,TRUE
"Unexpected drop in Productivity",FALSE,TRUE,FALSE

I've tried a few different methods so far, such as

any(Cost in Risk for Cost in Costs)

However, I'm not sure if there's a way to avoid the any() function being case sensitive, and I'm not sure how to apply this to a whole column.. I did try

df['Cost'] = any(Cost in df['Risk'] for Cost in Costs)

but that returned a column full of "FALSE". Any nudge in the right direction would be hugely appreciated! Thank you

Upvotes: 2

Views: 102

Answers (3)

Nick W
Nick W

Reputation: 146

I would make everything lowercase to get all matches regardsless of case, and then turn both the sentence and the words to check into sets, then check if there are any matches:

from io import StringIO

txt = '''Risk,Cost,Productivity,Security
"Unforeseen cost due to CCTV failures",,,
"Unexpected drop in Productivity",,,'''

df = pd.read_csv(
    StringIO(txt),
    sep=',',
    index_col=None,
    header=0
)
df['Risk'] = df['Risk'].str.lower()
df.columns = [item.lower() for item in df.columns]
print(df)

key_dict = {
    'cost': set([item.lower() for item in ['Cost']]),
    'productivity': set([item.lower() for item in ['Productivity']]),
    'security': set([item.lower() for item in ['Security','CCTV','Camera']])
}

for idx in df.index:
    word_set = set(df.loc[idx, 'risk'].split())
    print(word_set)

    for col in key_dict:
        if len(word_set & key_dict[col]) > 0:
            df.loc[idx, col] = True
        else:
            df.loc[idx, col] = False
                                   risk   cost productivity security
0  unforeseen cost due to cctv failures   True        False     True
1       unexpected drop in productivity  False         True    False

Upvotes: 2

Shubham Sharma
Shubham Sharma

Reputation: 71689

We can create a regex pattern corresponding to each of the list Cost, Security and Productivity, then using str.contains test for the occurrences of each of the regex pattern in the strings of column Risk

for c in ('Cost', 'Productivity', 'Security'):
    df[c] = df['Risk'].str.contains(fr"(?i)\b(?:{'|'.join(locals()[c])})\b")

                                   Risk   Cost  Productivity  Security
0  Unforeseen cost due to CCTV failures   True         False      True
1       Unexpected drop in Productivity  False          True     False

Upvotes: 3

Anurag Dabas
Anurag Dabas

Reputation: 24304

Firstly create/define a function:

def check():
    res=[]
    for x in Search:
        res.append(df['Risk'].str.split(' ',expand=True).isin(x).any(1))
    return pd.DataFrame(res).T

Finally:

df[['Cost','Productivity','Security']]=check()

Output of df:

    Risk                                    Cost    Productivity    Security
0   Unforeseen cost due to CCTV failures    False   False           True
1   Unexpected drop in Productivity         False   True            False

Upvotes: 2

Related Questions