Bimons
Bimons

Reputation: 241

Pandas drop duplicate rows INCLUDING index

I know how to drop duplicate rows based on column data. I also know how to drop dublicate rows based on row index. My question is: is there a way to drop duplicate rows based on index and one column?

Thanks!

Upvotes: 7

Views: 2363

Answers (2)

L0tad
L0tad

Reputation: 970

If you don't want to reset and then re-set your index as in JJ101's answer, you can make use of pandas' .duplicated() method instead of .drop_duplicates().

If you care about duplicates in the index and some column b, you can identify the corresponding indices with df.index.duplicated() and df.duplicated(subset="b"), respectively. Combine these using an & operator, and then negate that intersection using a ~, and you get something like

clean_df = df[~(df.index.duplicated() & df.duplicated(subset="b"))]
print(clean_df)

Output:

   a  b
0  1  2
1  2  2
2  3  3
3  4  4
5  4  5

Upvotes: 3

JJ101
JJ101

Reputation: 181

This can be done by turning the index into a column.

Below is a sample data set (fyi, I think someone downvoted your question because it didn't include a sample data set):

df=pd.DataFrame({'a':[1,2,2,3,4,4,5], 'b':[2,2,2,3,4,5,5]}, index=[0,1,1,2,3,5,5])

Output:

   a  b
0  1  2
1  2  2
1  2  2
2  3  3
3  4  4
5  4  5
5  5  5

Then you can use the following line. The first reset_index() makes a new column with the index numbers. Then you can drop duplicates based on the new index column and the other column (b in this case). Afterward, you can set the index to the original index values with set_index('index'):

df.reset_index().drop_duplicates(subset=['index','b']).set_index('index')

Ouput:

       a  b
index      
0      1  2
1      2  2
2      3  3
3      4  4
5      4  5

Upvotes: 6

Related Questions