Reputation: 4291
I would like to search keywords from a dataframe column, called 'string'.
The keywords are contained in a dictionary.
For each key, the value is an array of several keywords.
My concern is that the speed is very low and it takes a lot of time.
Maybe there are many loops involved and df.str.contains
cannot be used.
How to speed up the process?
def match(string, keyword):
m = len(string)
n = len(keyword)
idx = string.find(keyword)
if idx == -1:
return 0
if len(re.findall('[a-zA-Z]', string[idx])) > 0:
if idx > 0:
if len(re.findall('[a-zA-Z]', string[idx - 1])) > 0:
return 0
if len(re.findall('[a-zA-Z]', string[idx+n-1])) > 0:
if idx + n < m:
if len(re.findall('[a-zA-Z]', string[idx + n])) > 0:
return 0
return 1
def match_keyword(df, keyword_dict, name):
df_new = pd.DataFrame()
for owner_id, keyword in keyword_dict.items():
try:
for index, data in df.iterrows():
a = [match(data['string'], word) for word in keyword]
t = int(np.sum(a))
if t > 0:
df_new.loc[index, name+'_'+str(owner_id)] = 1
else:
df_new.loc[index, name+'_'+str(owner_id)] = 0
except:
df_new[name+'_'+str(owner_id)] = 0
return df_new.astype(int)
Input:
String
0 New Beauty Company is now offering 超級discounts
1 Swimming is good for children and adults
2 Children love food though it may not be good
keywords:{'a':['New', 'is', '超級'], 'b':['Swim', 'discounts', 'good']}
Results:
'New' 'is' '超級' result(or relation)
0 1 1 1 1
1 0 1 0 1
2 0 0 0 0
'Swim' 'discounts' 'good' result(or relation)
0 0 1 0 1
1 0 0 1 1
2 0 0 1 1
Final results:
'a' 'b'
0 1 1
1 1 1
2 0 1
Upvotes: 1
Views: 60
Reputation: 862571
I believe need str.contains
in loop by dict with word bondaries
by \b
with join by |
for regex OR
:
for k, v in keywords.items():
pat = '|'.join(r"\b{}\b".format(x) for x in v)
#print (pat)
df[k] = df['String'].str.contains(pat).astype(int)
print (df)
String a b
0 New Beauty Company is now offering discounts 1 1
1 Swimming is good for children and adults 1 1
2 Children love food though it may not be good 0 1
If need also columns by each value and create MultiIndex in columns:
df = df.set_index('String')
for k, v in keywords.items():
for x in v:
df[(k, x)] = df.index.str.contains(x).astype(int)
df.columns = pd.MultiIndex.from_tuples(df.columns)
print (df)
a b
New is Swim discounts good
String
New Beauty Company is now offering discounts 1 1 0 1 0
Swimming is good for children and adults 0 1 1 0 1
Children love food though it may not be good 0 0 0 0 1
And then is possible get max
by MultiIndex
:
df = df.max(axis=1, level=0)
print (df)
a b
String
New Beauty Company is now offering discounts 1 1
Swimming is good for children and adults 1 1
Children love food though it may not be good 0 1
Upvotes: 2