Reputation: 1749
I have the following dataframe 'matches_df', using python 2.7:
name | opponent | date | win
'Bob' 'Bill' 7/12/16 Y
'Mike' 'Matt' 4/15/18 N
'Tim' 'Tom' 1/1/11 N
'Bill' 'Bob' 7/12/16 N
I want a list not containing duplicate games. These are games that have the same two players (not necessarily in the same columns) and took place on the same date. In the above example, games 1 and 4 are duplicates.
In order to address this, I tried to create a 4th column, game_id, that creates a sorted combination of the first 3 columns. Eg I wanted this result:
name | opponent | date | win | game_id
'Bob' 'Bill' 7/12/16 Y '7/12/16 Bill Bob'
'Mike' 'Matt' 4/15/18 N '4/15/18 Matt Mike'
'Tim' 'Tom' 1/1/11 N '1/1/11 Tim Tom'
'Bill' 'Bob' 7/12/16 N '7/12/16 Bill Bob'
I used the following code:
def sort_and_squash(a,b,c):
return ''.join(sorted([str(a),str(b),str(c)]))
matches_df = matches_df.assign(game_id = lambda x: sort_and_squash(x.name,x.opponent,x.date))
However, this did not work as intended, creating a blank column in the data frame.
I am looking for help either in finding the error in my code for the intermediate step or recommending an alternative approach.
Upvotes: 2
Views: 209
Reputation: 5808
Although this has already been answered by piRSquared, if you'd rather have something closer to your original approach (or you want to understand why your original approach did not work), you can try this.
def sort_and_squash(df):
return [' '.join(sorted([d.strftime('%m/%d/%Y'), n, o]))
for d, n, o in zip(df.date, df.name, df.opponent)]
matches_df = matches_df.assign(game_id=sort_and_squash)
The function that is passed to the assign
method expects as an argument the dataframe and is expected to return the whole new column. You need something like the list comprehension above, to make that work.
>>> print matches_df
date name opponent win game_id
0 2016-07-12 Bob Bill Y 07/12/2016 Bill Bob
1 2018-04-15 Mike Matt N 04/15/2018 Matt Mike
2 2011-01-01 Tim Tom N 01/01/2011 Tim Tom
3 2016-07-12 Bill Bob N 07/12/2016 Bill Bob
Of course, this just inserts the game_id
column; it does not eliminate duplicates. To eliminate them, you also need:
matches_df = matches_df.drop_duplicates(subset=['game_id'])
and then:
>>> print matches_df
date name opponent win game_id
0 2016-07-12 Bob Bill Y 07/12/2016 Bill Bob
1 2018-04-15 Mike Matt N 04/15/2018 Matt Mike
2 2011-01-01 Tim Tom N 01/01/2011 Tim Tom
As last touch, if you don't need the game_id
column any more, you can drop it with:
matches_df = matches_df.drop('game_id', 1)
which gives you:
>>> print matches_df
date name opponent win
0 2016-07-12 Bob Bill Y
1 2018-04-15 Mike Matt N
2 2011-01-01 Tim Tom N
Upvotes: 1
Reputation: 294198
Sort the players in columns ['name', 'opponent']
If they are the same but in different positions, sorting will put them in the same position.
game_id_df = df[['date']].join(df[['name', 'opponent']].apply(sorted, 1))
print(game_id_df)
date name opponent
0 7/12/16 Bill Bob
1 4/15/18 Matt Mike
2 1/1/11 Tim Tom
3 7/12/16 Bill Bob
Then join
strings and add column
df['game_id'] = game_id_df.apply(tuple, 1).str.join(' ')
print(df)
name opponent date win game_id
0 Bob Bill 7/12/16 Y 7/12/16 Bill Bob
1 Mike Matt 4/15/18 N 4/15/18 Matt Mike
2 Tim Tom 1/1/11 N 1/1/11 Tim Tom
3 Bill Bob 7/12/16 N 7/12/16 Bill Bob
Now you can use game_id
to drop duplicates
print(df.drop_duplicates(subset=['game_id']))
name opponent date win game_id
0 Bob Bill 7/12/16 Y 7/12/16 Bill Bob
1 Mike Matt 4/15/18 N 4/15/18 Matt Mike
2 Tim Tom 1/1/11 N 1/1/11 Tim Tom
Upvotes: 3