Debadri Dutta
Debadri Dutta

Reputation: 1193

Return Column with list of Keywords present in String Column - Pandas

I have a list keywords and a dataframe:

keywords=['chair','table', 'fan']

Description

The table is 6 inches long
The fan is really good
The table fan is cheap
The chair is broken
The chair is on the table

I want to search the list of keywords and create a new column with which keyword from the list is present in the Description column.

Description                   Keyword

The table is 6 inches long    table
The fan is really good        fan
The table fan is cheap        table, fan
The chair is broken           chair
The chair is on the table     chair, table

I have searched for a few solutions, but none of them seems to work. I tried the following code on my own:

for i in word_set:
    for x in range(0, len(df)):
        if(df['Event Message'][x] in (i)):
            df['word'] = i

But obviously the time complexity is too high and is taking lot of time. Any help would be appreciated.

Upvotes: 2

Views: 1646

Answers (1)

jezrael
jezrael

Reputation: 862406

Use Series.str.findall with Series.str.join and join values in list by regex or - |:

keywords=['chair','table', 'fan']
df['Keyword'] = df['Description'].str.findall('|'.join(keywords)).apply(set).str.join(', ')

print (df)
                  Description       Keyword
0  The table is 6 inches long         table
1      The fan is really good           fan
2      The table fan is cheap    table, fan
3         The chair is broken         chair
4   The chair is on the table  chair, table

If need words boundaries for avoid extract subtrings:

keywords=['chair','tab', 'fan']
pat = '|'.join(r"\b{}\b".format(x) for x in keywords)

df['Keyword1'] = df['Description'].str.findall(pat).apply(set).str.join(', ')
df['Keyword2'] = df['Description'].str.findall('|'.join(keywords)).apply(set).str.join(', ')
print (df)
                  Description Keyword1    Keyword2
0  The table is 6 inches long                  tab
1      The fan is really good      fan         fan
2      The table fan is cheap      fan    tab, fan
3         The chair is broken    chair       chair
4   The chair is on the table    chair  chair, tab

EDIT: For improve performance is possible use custom function with split and test membership in set:

keywords=['chair','table', 'fan']
s = set(keywords)
f = lambda x: ', '.join(set([y for y in x.split() if y in s]))
df['Keyword1'] = df['Description'].apply(f)

Also list comprehension should be faster:

df['Keyword1'] = [', '.join(set([y for y in x.split() if y in s])) for x in df['Description']]

print (df)
                  Description      Keyword1
0  The table is 6 inches long         table
1      The fan is really good           fan
2      The table fan is cheap    fan, table
3         The chair is broken         chair
4   The chair is on the table  table, chair

Thank you, @Henry Yik for another solution with set.intersection:

df['Keyword1'] = df['Description'].apply(lambda x: ', '.join(set(x.split()).intersection(s)))

Upvotes: 3

Related Questions