Chan
Chan

Reputation: 4301

How to categorized data in pandas using contained keywords

Let df be the dataframe as follows:

      date   text
0  2019-6-7  London is good.             
1  2019-5-8  I am going to Paris.        
2  2019-4-4  Do you want to go to London?
3  2019-3-7  I love Paris!   

I would like to add a column city, which indicates the city contained in text, that is,

       date  text                          city
0  2019-6-7  London is good.               London
1  2019-5-8  I am going to Paris.          Paris 
2  2019-4-4  Do you want to go to London?  London
3  2019-3-7  I love Paris!                 Paris 

How to do it without using lambda?

Upvotes: 2

Views: 71

Answers (2)

Mohit Motwani
Mohit Motwani

Reputation: 4792

Adding to @WenYoBen's method, if there is only either of Paris or London in one text then str.extract is better:

regex = '(London|Paris)'
df['city'] = df.text.str.extract(regex)
df

       date         text                        city
0   2019-6-7    London is good.                 London
1   2019-5-8    I am going to Paris.            Paris
2   2019-4-4    Do you want to go to London?    London
3   2019-3-7    I love Paris!                   Paris

And if you want all the cities in your regex in a text then str.extractall is an option too:

df['city'] = df.text.str.extractall(regex).values
df
          date  text                           city
0    2019-6-7   London is good.                London
1    2019-5-8   I am going to Paris.           Paris
2    2019-4-4   Do you want to go to London?   London
3    2019-3-7   I love Paris!                  Paris

Note that if there are multiple matches, the extractall will return a list

Upvotes: 3

BENY
BENY

Reputation: 323236

You can first match sure you have the list of city , then str.findall

df.text.str.findall('London|Paris').str[0]
Out[320]: 
0    London
1     Paris
2    London
3     Paris
Name: text, dtype: object
df['city'] = df.text.str.findall('London|Paris').str[0]

Upvotes: 3

Related Questions