nipy
nipy

Reputation: 5498

New column showing number of occurrences of unique rows

I am trying to work out how to use Pandas to add a new column showing the number of occurrences of unique rows and then delete any duplicates. I can get close to this output when not using pandas with:

sort <inputfile | uniq -c 

or via excel with a new column showing countif or similar. Has anyone done this in Pandas and would be able to help please?

Upvotes: 0

Views: 216

Answers (1)

albert
albert

Reputation: 8613

You can use df.drop_duplicates() to drop duplicate rows. In addition, if you want to have a new DataFrame showing you which rows are duplicate call df.duplicated().

#!/usr/bin/env python3
# coding: utf-8

import pandas as pd

# define DataFrame using same sample data
d = {'i': [1, 2, 3, 4, 5, 6, 1, 4, 9, 10 ], 'j': [4, 12, 13, 1 ,15, 16, 4, 1, 19, 20]}
df = pd.DataFrame(data=d)

# print sample DataFrame
print(df)

# print DataFrame with dropped duplicate rows
print(df.drop_duplicates())

# print DataFrame containing `True` for each duplicate row, see doc for further options
print(df.duplicated())

Edit (due to comments):

After defining the DataFrame df, try the following:

df.groupby(['i', 'j']).size()

.groupby() groups both columns, whereas .size() returns the number of elements in underlaying data.

Upvotes: 1

Related Questions