Chan
Chan

Reputation: 4291

How to increase the speed of pandas search process?

I would like to search keywords from a dataframe column, called 'string'.

The keywords are contained in a dictionary.

For each key, the value is an array of several keywords.

My concern is that the speed is very low and it takes a lot of time.

Maybe there are many loops involved and df.str.contains cannot be used.

How to speed up the process?

def match(string, keyword):
        m = len(string)
        n = len(keyword)
        idx = string.find(keyword)
        if idx == -1:
            return 0
        if len(re.findall('[a-zA-Z]', string[idx])) > 0:
            if idx > 0:
                if len(re.findall('[a-zA-Z]', string[idx - 1])) > 0:
                    return 0
        if len(re.findall('[a-zA-Z]', string[idx+n-1])) > 0:
            if idx + n < m:
                if len(re.findall('[a-zA-Z]', string[idx + n])) > 0:
                    return 0
        return 1

def match_keyword(df, keyword_dict, name):
        df_new = pd.DataFrame()
        for owner_id, keyword in keyword_dict.items():
            try:
                for index, data in df.iterrows():
                    a = [match(data['string'], word) for word in keyword]
                    t = int(np.sum(a))
                    if t > 0:
                        df_new.loc[index, name+'_'+str(owner_id)] = 1
                    else:
                        df_new.loc[index, name+'_'+str(owner_id)] = 0  
            except:
                df_new[name+'_'+str(owner_id)] = 0
        return df_new.astype(int)

Input:

  String
0 New Beauty Company is now offering 超級discounts
1 Swimming is good for children and adults
2 Children love food though it may not be good

keywords:{'a':['New', 'is', '超級'], 'b':['Swim', 'discounts', 'good']}

Results:

     'New'  'is'  '超級'   result(or relation)
0     1       1      1        1
1     0       1      0        1
2     0       0      0        0

     'Swim'  'discounts'  'good' result(or relation)
0     0          1          0        1
1     0          0          1        1
2     0          0          1        1

Final results:

    'a'    'b'
0    1      1
1    1      1
2    0      1

Upvotes: 1

Views: 60

Answers (1)

jezrael
jezrael

Reputation: 862571

I believe need str.contains in loop by dict with word bondaries by \b with join by | for regex OR:

for k, v in keywords.items():
    pat = '|'.join(r"\b{}\b".format(x) for x in v)
    #print (pat)
    df[k] = df['String'].str.contains(pat).astype(int)

print (df)
                                         String  a  b
0  New Beauty Company is now offering discounts  1  1
1      Swimming is good for children and adults  1  1
2  Children love food though it may not be good  0  1

If need also columns by each value and create MultiIndex in columns:

df = df.set_index('String')
for k, v in keywords.items():
    for x in v:
        df[(k, x)] = df.index.str.contains(x).astype(int)

df.columns = pd.MultiIndex.from_tuples(df.columns)
print (df)
                                               a       b               
                                             New is Swim discounts good
String                                                                 
New Beauty Company is now offering discounts   1  1    0         1    0
Swimming is good for children and adults       0  1    1         0    1
Children love food though it may not be good   0  0    0         0    1

And then is possible get max by MultiIndex:

df = df.max(axis=1, level=0)
print (df)
                                              a  b
String                                            
New Beauty Company is now offering discounts  1  1
Swimming is good for children and adults      1  1
Children love food though it may not be good  0  1

Upvotes: 2

Related Questions