Reputation: 247
I have a dataframe like below:
Player Name Headline
1 LeBron James LeBron James suggests 5-10 games before playoff
2 LeBron James LeBron James (groin) probable for Thursday
3 LeBron James LeBron James overcomes Pelicans with 34/13/12
4 LeBron James LeBron James (groin) plans to play on Tuesday
5 LeBron James LeBron James (rest) questionable Tuesday
6 LeBron James LeBron James (leg) will start on Saturday
7 LeBron James LeBron James (hip) is questionable
8 Ryan Anderson Anderson (flu) returns against Cavs on Sunday
9 Ryan Anderson Ryan Anderson out with respiratory infection
10 Ryan Anderson Anderson (rest) not playing
I want to remove any rows that do not have any (text)
in the Headline column. Also, I want to have two new columns labeled Injury/Rest
and Location
like below.
The new dataframe output I want:
Player Name Headline Injury/Rest Location
2 LeBron James LeBron James (groin) probable... Injury groin
4 LeBron James LeBron James (groin) plans... Injury groin
5 LeBron James LeBron James (rest) questionable.. Rest rest
6 LeBron James LeBron James (leg) will... Injury leg
7 LeBron James LeBron James (sore hip) is... Injury sore hip
8 Ryan Anderson Anderson (flu) returns... Injury flu
10 Ryan Anderson Anderson (rest) not... Rest rest
As you can see, the rows that didn't have any (text)
in the Headline column were removed. The ones that have (text)
were then categorized in a new column Injury/Rest
and Location
like above.
I have done df1 = df[df['Headline'].str.contains("(rest)")]
to extract all the (rest)
rows from the Headline column. There is over 100k rows so I do not know how to do every injury in the ( )
and add data in the two new columns.
How do I get the output I want to clean up the dataframe?
Upvotes: 0
Views: 49
Reputation: 478
You can accomplish this like so:
import pandas as pd
def get_injury_rest(value):
if "(rest)" in value.lower():
return "Rest"
elif "(" and ")" in value:
return "Injury"
df = pd.read_csv("Players.csv")
df.loc[:, "Injury/Rest"] = [get_injury_rest(value) for value in df.loc[:, "Headline"]]
df = df.dropna()
df.loc[:, "Location"] = [value.split("(")[1].split(")")[0] for value in df.loc[:, "Headline"]]
Upvotes: 0
Reputation: 28709
#keep only rows that have text bounded within brackets
res = (df.loc[df.Headline.str.contains(r"\(.+\)")]
#extract text within brackets
.assign(Location = lambda x: x.Headline.str.extract(r"((?<=[(]).+(?=\)))"),
Injury_Rest = lambda x: np.where(x.Location.eq("rest"), "Rest","Injury")
)
)
res
Player Name Headline Location Injury_Rest
2 LeBron James LeBron James (groin) probable for Thursday groin Injury
4 LeBron James LeBron James (groin) plans to play on Tuesday groin Injury
5 LeBron James LeBron James (rest) questionable Tuesday rest Rest
6 LeBron James LeBron James (leg) will start on Saturday leg Injury
7 LeBron James LeBron James (hip) is questionable hip Injury
8 Ryan Anderson Anderson (flu) returns against Cavs on Sunday flu Injury
10 Ryan Anderson Anderson (rest) not playing rest Rest
Upvotes: 1
Reputation: 150785
This is what I would do:
df['Location'] = df.Headline.str.extract('\((.*)\)')[0]
df = df[df['Location'].notnull()]
df['Injury/Rest'] = np.where(df['Location'].eq('rest'), 'Rest', 'Injury')
Output:
Player Name Headline Location Injury/Rest
-- ------------- --------------------------------------------- ---------- -------------
2 LeBron James LeBron James (groin) probable for Thursday groin Injury
4 LeBron James LeBron James (groin) plans to play on Tuesday groin Injury
5 LeBron James LeBron James (rest) questionable Tuesday rest Rest
6 LeBron James LeBron James (leg) will start on Saturday leg Injury
7 LeBron James LeBron James (hip) is questionable hip Injury
8 Ryan Anderson Anderson (flu) returns against Cavs on Sunday flu Injury
10 Ryan Anderson Anderson (rest) not playing rest Rest
Upvotes: 1