Graham Chapman
Graham Chapman

Reputation: 383

Removing data from a column in pandas

I'm trying to prune some data from my data frame but only the rows where there are duplicates in the "To country" column

My data frame looks like this:

   Year From country To country  Points
0  2016      Albania    Armenia       0
1  2016      Albania    Armenia       2
2  2016      Albania  Australia      12
      Year    From country       To country  Points
2129  2016  United Kingdom  The Netherlands       0
2130  2016  United Kingdom          Ukraine      10
2131  2016  United Kingdom          Ukraine       5

[2132 rows x 4 columns]

I try this on it:

df.drop_duplicates(subset='To country', inplace=True)

And what happens is this:

   Year From country To country  Points
0  2016      Albania    Armenia       0
2  2016      Albania  Australia      12
4  2016      Albania    Austria       0
    Year From country       To country  Points
46  2016      Albania  The Netherlands       0
48  2016      Albania          Ukraine       0
50  2016      Albania   United Kingdom       5

[50 rows x 4 columns]

While this does get rid of the duplicated 'To country' entries, it also removes all the values of the 'From country' column. I must be using the drop_duplicates() wrong, but the pandas documentation isn't helping me understand why its dropping more than I'd expect it to?

Upvotes: 2

Views: 87

Answers (2)

Arya McCarthy
Arya McCarthy

Reputation: 8829

No, this behavior is correct—assuming every team played every other team, it's finding the firsts, and all of those firsts are "From" Albania.

From what you've said below, you want to keep row 0, but not row 1 because it repeats both the To and From countries. The way to eliminate those is:

df.drop_duplicates(subset=['To country', 'From country'], inplace=True)

Upvotes: 3

DYZ
DYZ

Reputation: 57033

The simplest solution is to group by the 'to country' name and take the first (or the last, if you prefer) row from each group:

df.groupby('To country').first().reset_index()
#        To country  Year    From country  Points
#0          Armenia  2016         Albania       0
#1        Australia  2016         Albania      12
#2  The Netherlands  2016  United Kingdom       0
#3          Ukraine  2016  United Kingdom      10

Compared to aryamccarthy's solution, this one gives you more control over which duplicates to keep.

Upvotes: 1

Related Questions