Removing data from a column in pandas

Question

I'm trying to prune some data from my data frame but only the rows where there are duplicates in the "To country" column

My data frame looks like this:

   Year From country To country  Points
0  2016      Albania    Armenia       0
1  2016      Albania    Armenia       2
2  2016      Albania  Australia      12
      Year    From country       To country  Points
2129  2016  United Kingdom  The Netherlands       0
2130  2016  United Kingdom          Ukraine      10
2131  2016  United Kingdom          Ukraine       5

[2132 rows x 4 columns]

I try this on it:

df.drop_duplicates(subset='To country', inplace=True)

And what happens is this:

   Year From country To country  Points
0  2016      Albania    Armenia       0
2  2016      Albania  Australia      12
4  2016      Albania    Austria       0
    Year From country       To country  Points
46  2016      Albania  The Netherlands       0
48  2016      Albania          Ukraine       0
50  2016      Albania   United Kingdom       5

[50 rows x 4 columns]

While this does get rid of the duplicated 'To country' entries, it also removes all the values of the 'From country' column. I must be using the drop_duplicates() wrong, but the pandas documentation isn't helping me understand why its dropping more than I'd expect it to?

Arya McCarthy · Accepted Answer

No, this behavior is correct—assuming every team played every other team, it's finding the firsts, and all of those firsts are "From" Albania.

From what you've said below, you want to keep row 0, but not row 1 because it repeats both the To and From countries. The way to eliminate those is:

df.drop_duplicates(subset=['To country', 'From country'], inplace=True)

Removing data from a column in pandas

Answers (2)

Related Questions