Reputation: 247

How to take words from parentheses in a column to new columns in pandas?

I have a dataframe like below:

      Player Name       Headline
1     LeBron James      LeBron James suggests 5-10 games before playoff
2     LeBron James      LeBron James (groin) probable for Thursday 
3     LeBron James      LeBron James overcomes Pelicans with 34/13/12
4     LeBron James      LeBron James (groin) plans to play on Tuesday   
5     LeBron James      LeBron James (rest) questionable Tuesday      
6     LeBron James      LeBron James (leg) will start on Saturday   
7     LeBron James      LeBron James (hip) is questionable 
8     Ryan Anderson     Anderson (flu) returns against Cavs on Sunday   
9     Ryan Anderson     Ryan Anderson out with respiratory infection   
10    Ryan Anderson     Anderson (rest) not playing

I want to remove any rows that do not have any (text) in the Headline column. Also, I want to have two new columns labeled Injury/Rest and Location like below.

The new dataframe output I want:

      Player Name       Headline                           Injury/Rest  Location
2     LeBron James      LeBron James (groin) probable...   Injury       groin
4     LeBron James      LeBron James (groin) plans...      Injury       groin
5     LeBron James      LeBron James (rest) questionable.. Rest         rest
6     LeBron James      LeBron James (leg) will...         Injury       leg
7     LeBron James      LeBron James (sore hip) is...      Injury       sore hip
8     Ryan Anderson     Anderson (flu) returns...          Injury       flu
10    Ryan Anderson     Anderson (rest) not...             Rest         rest

As you can see, the rows that didn't have any (text) in the Headline column were removed. The ones that have (text) were then categorized in a new column Injury/Rest and Location like above.

I have done df1 = df[df['Headline'].str.contains("(rest)")] to extract all the (rest) rows from the Headline column. There is over 100k rows so I do not know how to do every injury in the ( ) and add data in the two new columns.

How do I get the output I want to clean up the dataframe?

Upvotes: 0

Answers (3)

Rexovas

Reputation: 478

You can accomplish this like so:

import pandas as pd


def get_injury_rest(value):
    if "(rest)" in value.lower():
        return "Rest"
    elif "(" and ")" in value:
        return "Injury"


df = pd.read_csv("Players.csv")
df.loc[:, "Injury/Rest"] = [get_injury_rest(value) for value in df.loc[:, "Headline"]]
df = df.dropna()
df.loc[:, "Location"] = [value.split("(")[1].split(")")[0] for value in df.loc[:, "Headline"]]

Upvotes: 0

sammywemmy

Reputation: 28709

              #keep only rows that have text bounded within brackets
res = (df.loc[df.Headline.str.contains(r"\(.+\)")]
              #extract text within brackets
       .assign(Location = lambda x: x.Headline.str.extract(r"((?<=[(]).+(?=\)))"),
               Injury_Rest = lambda x: np.where(x.Location.eq("rest"), "Rest","Injury")
              )
      )

res


     Player Name    Headline                                           Location Injury_Rest
2   LeBron James    LeBron James (groin) probable for Thursday          groin   Injury
4   LeBron James    LeBron James (groin) plans to play on Tuesday       groin   Injury
5   LeBron James    LeBron James (rest) questionable Tuesday            rest    Rest
6   LeBron James    LeBron James (leg) will start on Saturday           leg  Injury
7   LeBron James    LeBron James (hip) is questionable                  hip Injury
8   Ryan Anderson   Anderson (flu) returns against Cavs on Sunday       flu Injury
10  Ryan Anderson   Anderson (rest) not playing                         rest    Rest

Upvotes: 1

Quang Hoang

Reputation: 150785

This is what I would do:

df['Location'] = df.Headline.str.extract('\((.*)\)')[0]
df = df[df['Location'].notnull()]
df['Injury/Rest'] = np.where(df['Location'].eq('rest'), 'Rest', 'Injury')

Output:

    Player Name    Headline                                       Location    Injury/Rest
--  -------------  ---------------------------------------------  ----------  -------------
 2  LeBron James   LeBron James (groin) probable for Thursday     groin       Injury
 4  LeBron James   LeBron James (groin) plans to play on Tuesday  groin       Injury
 5  LeBron James   LeBron James (rest) questionable Tuesday       rest        Rest
 6  LeBron James   LeBron James (leg) will start on Saturday      leg         Injury
 7  LeBron James   LeBron James (hip) is questionable             hip         Injury
 8  Ryan Anderson  Anderson (flu) returns against Cavs on Sunday  flu         Injury
10  Ryan Anderson  Anderson (rest) not playing                    rest        Rest

Upvotes: 1

How to take words from parentheses in a column to new columns in pandas?

Answers (3)

Related Questions