Reputation: 5498
I am trying to work out how to use Pandas to add a new column showing the number of occurrences of unique rows and then delete any duplicates. I can get close to this output when not using pandas with:
sort <inputfile | uniq -c
or via excel with a new column showing countif or similar. Has anyone done this in Pandas and would be able to help please?
Upvotes: 0
Views: 216
Reputation: 8613
You can use df.drop_duplicates()
to drop duplicate rows.
In addition, if you want to have a new DataFrame showing you which rows are duplicate call df.duplicated()
.
#!/usr/bin/env python3
# coding: utf-8
import pandas as pd
# define DataFrame using same sample data
d = {'i': [1, 2, 3, 4, 5, 6, 1, 4, 9, 10 ], 'j': [4, 12, 13, 1 ,15, 16, 4, 1, 19, 20]}
df = pd.DataFrame(data=d)
# print sample DataFrame
print(df)
# print DataFrame with dropped duplicate rows
print(df.drop_duplicates())
# print DataFrame containing `True` for each duplicate row, see doc for further options
print(df.duplicated())
Edit (due to comments):
After defining the DataFrame df
, try the following:
df.groupby(['i', 'j']).size()
.groupby()
groups both columns, whereas .size()
returns the number of elements in underlaying data.
Upvotes: 1