Roy
Roy

Reputation: 1044

Extract Words As A List After A Specicfic Word

I've got this DataFrame, which is a description of sports and their leagues:

df = pd.DataFrame({
    'text': ['The following leagues were identified: sport: basketball league: NBA sport: soccer league:EPL sport: football league: NFL']
})

df.text.iloc[0]
===============
The following leagues were identified: sport: basketball league: NBA sport: soccer league: EPL sport: football league: NFL

I need to extract all the sport names (which are coming after sport:) and put them as a list in a new column sports. I'm trying the following code:

pat = 'sport:\W+(?P<sports>(?:\w+))'
new = df.text.str.extract(pat, expand=True)
df.assign(**new)
===============
                        text                              sports
0   The following leagues were identified: sport: ...   basketball

However, it's returning only the 1st occurrence of the sport as a standalone string, whereas I need all the sports as a list.

Desired output:

               text                               sports
0   The following leagues were ...  [basketball, soccer, football]

What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!

Upvotes: 2

Views: 89

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627609

You can use Series.str.findall:

df['sports'] = df['text'].str.findall(r'sport:\W*(\w+)')

Pandas test:

import pandas as pd
df = pd.DataFrame({
    'text': ['The following leagues were identified: sport: basketball league: NBA sport: soccer league:EPL sport: football league: NFL']
})

Output:

>>> df['text'].str.findall('sport:\W*(\w+)')
0    [basketball, soccer, football]
Name: text, dtype: object

Upvotes: 4

Related Questions