Nick
Nick

Reputation: 247

How to take a portion of one column to match with a portion of another column in pandas?

I am trying to clean up a large dataset. A sample of the dataset shows:

         Player Name        Headline    
0        LeBron James       Woj: NBA. ESPN planning for a H-O-R-S-E tourna...
1        LeBron James       LeBron James suggests 5-10 games before playoffs
2        LeBron James       LeBron James overcomes Pelicans with 34/13/12
3        LeBron James       Anthony Davis (knee) won't play on Sunday
...      ...                ...
105180   Ryan Anderson      Ryan Anderson has first career double-double
105181   Ryan Anderson      Yi says he'll be back after All-Star break
105182   Ryan Anderson      Anderson stays hot in Nets win

The problem I am having is that some headlines do not have anything to do with the player name (as you can see in row 0, 3, and 105181).

Question:

Is there a way to clean up the data so a portion of the "Player Name" column is featured in the "Headline" column? The rest that don't match would be removed from the new dataset made.

For example:

Row 0 would be removed since "LeBron" or "James" isn't in the "Headline" column.

Row 1 and 2 would stay the same.

Row 3 would be removed since "LeBron" or "James" isn't in the "Headline" column.

Row 105180 would stay the same.

Row 105181 would be removed since "Ryan" or "Anderson" isn't in the "Headline" column.

Row 105182 would stay the same since "Anderson" is in the "Player Name" and "Headline" column.

The new dataset will feature only the filtered data that has a portion of the "Player Name" column in the "Headline" column. I didn't know if there was a way to do this or if str.split should be used on the "Headline" column to split up the name from the rest of the headline then clean up from there.

Upvotes: 1

Views: 174

Answers (4)

Ben.T
Ben.T

Reputation: 29635

You could use groupby as you seem to have several times the same player and then str.contains on the name of the player once split to get either part of the name.

df_ = df[df.groupby('Player Name')
           .apply(lambda x: x['Headline'].str.contains('|'.join(x.name.split(' '))))
           .reset_index(level=0, drop=True)]
print (df_)
          Player Name                                          Headline

1        LeBron James  LeBron James suggests 5-10 games before playoffs
2        LeBron James     LeBron James overcomes Pelicans with 34/13/12
105180  Ryan Anderson      Ryan Anderson has first career double-double
105182  Ryan Anderson                    Anderson stays hot in Nets win

Upvotes: 0

ashkangh
ashkangh

Reputation: 1624

You can write a function and apply it to each row, like this:

def keep_headlines(row):
    first = row['Player Name'].split()[0]
    last = row['Player Name'].split()[1]
    if (first in row['Headline']) | (last in row['Headline']):
        return 1 
    else:
        return 0

you can then apply this function using apply method:

df['mask'] = df.apply(keep_headlines, axis=1)
df = df[df['mask'] == 1]

then you can drop the "mask" column.

df.drop(columns='mask', inplace=True)

I hope this answer helps you!

Upvotes: 4

NYC Coder
NYC Coder

Reputation: 7594

You can try this:

df['exists'] = df.apply(lambda x: any([k in x['Headline'] for k in x['Player Name'].split()]), axis=1)
df = df[~df['exists']==False].drop(columns=['exists'])
print(df)

Output:

     Player Name                                          Headline
1   LeBron James  LeBron James suggests 5-10 games before playoffs
2   LeBron James     LeBron James overcomes Pelicans with 34/13/12
4  Ryan Anderson      Ryan Anderson has first career double-double
6  Ryan Anderson                    Anderson stays hot in Nets win

Upvotes: 2

ggorlen
ggorlen

Reputation: 56965

You can use split on each cell of the Player Name column and run an in check for each of the split words on the corresponding Headline. If any pass the in test, keep the row.

def any_words_included(x):
    return any(y in x["Headline"] for y in x["Player Name"].split())

df = df[df.apply(any_words_included, axis=1)]

Provide the kwarg axis=1 to apply to filter on rows.

Upvotes: 1

Related Questions