Reputation: 25
Bit of a beginner question here; I currently have a pandas df with one column containing various different strings. I have some more columns which are currently empty. Example of first few rows below;
Risk,Cost,Productivity,Security
"Unforeseen cost due to CCTV failures",,,
"Unexpected drop in Productivity",,,
I've also created a set of lists as follows;
Cost = ['Cost']
Productivity = ['Productivity']
Security = ['Security','CCTV','Camera']
Essentially what I want to do is I want to go through each column and check whether the string in the "Risk" column on the same row contains a substring that matches one of the strings in the list. The ideal output would be as follows;
Risk,Cost,Productivity,Security
"Unforeseen cost due to security issues",TRUE,FALSE,TRUE
"Unexpected drop in Productivity",FALSE,TRUE,FALSE
I've tried a few different methods so far, such as
any(Cost in Risk for Cost in Costs)
However, I'm not sure if there's a way to avoid the any() function being case sensitive, and I'm not sure how to apply this to a whole column.. I did try
df['Cost'] = any(Cost in df['Risk'] for Cost in Costs)
but that returned a column full of "FALSE". Any nudge in the right direction would be hugely appreciated! Thank you
Upvotes: 2
Views: 102
Reputation: 146
I would make everything lowercase to get all matches regardsless of case, and then turn both the sentence and the words to check into sets, then check if there are any matches:
from io import StringIO
txt = '''Risk,Cost,Productivity,Security
"Unforeseen cost due to CCTV failures",,,
"Unexpected drop in Productivity",,,'''
df = pd.read_csv(
StringIO(txt),
sep=',',
index_col=None,
header=0
)
df['Risk'] = df['Risk'].str.lower()
df.columns = [item.lower() for item in df.columns]
print(df)
key_dict = {
'cost': set([item.lower() for item in ['Cost']]),
'productivity': set([item.lower() for item in ['Productivity']]),
'security': set([item.lower() for item in ['Security','CCTV','Camera']])
}
for idx in df.index:
word_set = set(df.loc[idx, 'risk'].split())
print(word_set)
for col in key_dict:
if len(word_set & key_dict[col]) > 0:
df.loc[idx, col] = True
else:
df.loc[idx, col] = False
risk cost productivity security
0 unforeseen cost due to cctv failures True False True
1 unexpected drop in productivity False True False
Upvotes: 2
Reputation: 71689
We can create a regex pattern corresponding to each of the list Cost
, Security
and Productivity
, then using str.contains
test for the occurrences of each of the regex pattern in the strings of column Risk
for c in ('Cost', 'Productivity', 'Security'):
df[c] = df['Risk'].str.contains(fr"(?i)\b(?:{'|'.join(locals()[c])})\b")
Risk Cost Productivity Security
0 Unforeseen cost due to CCTV failures True False True
1 Unexpected drop in Productivity False True False
Upvotes: 3
Reputation: 24304
Firstly create/define a function:
def check():
res=[]
for x in Search:
res.append(df['Risk'].str.split(' ',expand=True).isin(x).any(1))
return pd.DataFrame(res).T
Finally:
df[['Cost','Productivity','Security']]=check()
Output of df
:
Risk Cost Productivity Security
0 Unforeseen cost due to CCTV failures False False True
1 Unexpected drop in Productivity False True False
Upvotes: 2