identification of duplicates in pandas dataframe

Question

I have the following dataframe 'matches_df', using python 2.7:

name | opponent | date     | win
'Bob'    'Bill'   7/12/16     Y
'Mike'   'Matt'   4/15/18     N
'Tim'    'Tom'    1/1/11      N
'Bill'   'Bob'    7/12/16     N

I want a list not containing duplicate games. These are games that have the same two players (not necessarily in the same columns) and took place on the same date. In the above example, games 1 and 4 are duplicates.

In order to address this, I tried to create a 4th column, game_id, that creates a sorted combination of the first 3 columns. Eg I wanted this result:

name | opponent | date    | win | game_id
'Bob'    'Bill'   7/12/16    Y   '7/12/16 Bill Bob'  
'Mike'   'Matt'   4/15/18    N   '4/15/18 Matt Mike'
'Tim'    'Tom'    1/1/11     N   '1/1/11 Tim Tom'
'Bill'   'Bob'    7/12/16    N   '7/12/16 Bill Bob'

I used the following code:

def sort_and_squash(a,b,c):
    return ''.join(sorted([str(a),str(b),str(c)]))
matches_df = matches_df.assign(game_id = lambda x: sort_and_squash(x.name,x.opponent,x.date))

However, this did not work as intended, creating a blank column in the data frame.

I am looking for help either in finding the error in my code for the intermediate step or recommending an alternative approach.

piRSquared · Accepted Answer

Sort the players in columns ['name', 'opponent']
If they are the same but in different positions, sorting will put them in the same position.

game_id_df = df[['date']].join(df[['name', 'opponent']].apply(sorted, 1))
print(game_id_df)

      date  name opponent
0  7/12/16  Bill      Bob
1  4/15/18  Matt     Mike
2   1/1/11   Tim      Tom
3  7/12/16  Bill      Bob

Then join strings and add column

df['game_id'] = game_id_df.apply(tuple, 1).str.join(' ')
print(df)

   name opponent     date win            game_id
0   Bob     Bill  7/12/16   Y   7/12/16 Bill Bob
1  Mike     Matt  4/15/18   N  4/15/18 Matt Mike
2   Tim      Tom   1/1/11   N     1/1/11 Tim Tom
3  Bill      Bob  7/12/16   N   7/12/16 Bill Bob

Now you can use game_id to drop duplicates

print(df.drop_duplicates(subset=['game_id']))

   name opponent     date win            game_id
0   Bob     Bill  7/12/16   Y   7/12/16 Bill Bob
1  Mike     Matt  4/15/18   N  4/15/18 Matt Mike
2   Tim      Tom   1/1/11   N     1/1/11 Tim Tom

identification of duplicates in pandas dataframe

Answers (2)

Related Questions