Reputation: 247
I am trying to clean up a large dataset. A sample of the dataset shows:
Player Name Headline
0 LeBron James Woj: NBA. ESPN planning for a H-O-R-S-E tourna...
1 LeBron James LeBron James suggests 5-10 games before playoffs
2 LeBron James LeBron James overcomes Pelicans with 34/13/12
3 LeBron James Anthony Davis (knee) won't play on Sunday
... ... ...
105180 Ryan Anderson Ryan Anderson has first career double-double
105181 Ryan Anderson Yi says he'll be back after All-Star break
105182 Ryan Anderson Anderson stays hot in Nets win
The problem I am having is that some headlines do not have anything to do with the player name (as you can see in row 0, 3, and 105181).
Question:
Is there a way to clean up the data so a portion of the "Player Name" column is featured in the "Headline" column? The rest that don't match would be removed from the new dataset made.
For example:
Row 0 would be removed since "LeBron" or "James" isn't in the "Headline" column.
Row 1 and 2 would stay the same.
Row 3 would be removed since "LeBron" or "James" isn't in the "Headline" column.
Row 105180 would stay the same.
Row 105181 would be removed since "Ryan" or "Anderson" isn't in the "Headline" column.
Row 105182 would stay the same since "Anderson" is in the "Player Name" and "Headline" column.
The new dataset will feature only the filtered data that has a portion of the "Player Name" column in the "Headline" column. I didn't know if there was a way to do this or if str.split
should be used on the "Headline" column to split up the name from the rest of the headline then clean up from there.
Upvotes: 1
Views: 174
Reputation: 29635
You could use groupby
as you seem to have several times the same player and then str.contains
on the name of the player once split
to get either part of the name.
df_ = df[df.groupby('Player Name')
.apply(lambda x: x['Headline'].str.contains('|'.join(x.name.split(' '))))
.reset_index(level=0, drop=True)]
print (df_)
Player Name Headline
1 LeBron James LeBron James suggests 5-10 games before playoffs
2 LeBron James LeBron James overcomes Pelicans with 34/13/12
105180 Ryan Anderson Ryan Anderson has first career double-double
105182 Ryan Anderson Anderson stays hot in Nets win
Upvotes: 0
Reputation: 1624
You can write a function and apply it to each row, like this:
def keep_headlines(row):
first = row['Player Name'].split()[0]
last = row['Player Name'].split()[1]
if (first in row['Headline']) | (last in row['Headline']):
return 1
else:
return 0
you can then apply this function using apply method:
df['mask'] = df.apply(keep_headlines, axis=1)
df = df[df['mask'] == 1]
then you can drop the "mask" column.
df.drop(columns='mask', inplace=True)
I hope this answer helps you!
Upvotes: 4
Reputation: 7594
You can try this:
df['exists'] = df.apply(lambda x: any([k in x['Headline'] for k in x['Player Name'].split()]), axis=1)
df = df[~df['exists']==False].drop(columns=['exists'])
print(df)
Output:
Player Name Headline
1 LeBron James LeBron James suggests 5-10 games before playoffs
2 LeBron James LeBron James overcomes Pelicans with 34/13/12
4 Ryan Anderson Ryan Anderson has first career double-double
6 Ryan Anderson Anderson stays hot in Nets win
Upvotes: 2
Reputation: 56965
You can use split
on each cell of the Player Name
column and run an in
check for each of the split words on the corresponding Headline
. If any
pass the in
test, keep the row.
def any_words_included(x):
return any(y in x["Headline"] for y in x["Player Name"].split())
df = df[df.apply(any_words_included, axis=1)]
Provide the kwarg axis=1
to apply
to filter on rows.
Upvotes: 1