sonali
sonali

Reputation: 69

Removing repetitive/duplicate occurance in excel using python

I am trying to remove the repetitive/duplicate Names which is coming under NAME column. I just want to keep the 1st occurrence from the repetitive/duplicate names by using python script.

This is my input excel:

input

And need output like this:

output

Upvotes: 2

Views: 170

Answers (1)

Umar.H
Umar.H

Reputation: 23099

This isn't removing duplicates per say you're just filling duplicate keys in one column as blanks, I would handle this as follows :

by creating a mask where you return a true/false boolean if the row is == the row above.

assuming your dataframe is called df

mask = df['NAME'].ne(df['NAME'].shift())

df.loc[~mask,'NAME'] = ''

explanation :

what we are doing above is the following,

first selecting a single column, or in pandas terminology a series, we then apply a .ne (not equal to) which in effect is !=

lets see this in action.

import pandas as pd
import numpy as np
# create data for dataframe
names = ['Rekha', 'Rekha','Jaya','Jaya','Sushma','Nita','Nita','Nita']
defaults = ['','','c-default','','','c-default','','']
classes = ['forth','third','foruth','fifth','fourth','third','fifth','fourth']

now, lets create a dataframe similar to yours.

df = pd.DataFrame({'NAME' : names,
         'DEFAULT' : defaults,
         'CLASS' : classes,
         'AGE' : [np.random.randint(1,5) for len in names], 
         'GROUP' : [np.random.randint(1,5) for len in names]}) # being lazy with your age and group variables. 

so, if we did df['NAME'].ne('Omar') which is the same as [df['NAME'] != 'Omar'] we would get.

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True

so, with that out of the way, we want to see if the name in row 1 (remember python is a 0 index language so row 1 is actually the 2nd physical row) is .eq to the row above.

we do this by calling [.shift][2] hyperlinked for more info.

what this basically does is shift the rows by its index with a defined variable number, lets call this n.

if we called df['NAME'].shift(1)

0       NaN
1     Rekha
2     Rekha
3      Jaya
4      Jaya
5    Sushma
6      Nita
7      Nita

we can see here that that Rekha has moved down

so putting that all together,

df['NAME'].ne(df['NAME'].shift())
0     True
1    False
2     True
3    False
4     True
5     True
6    False
7    False

we assign this to a self defined variable called mask you could call this whatever you want.

we then use [.loc][2] which lets you access your dataframe by labels or a boolean array, in this instance an array.

however, we only want to access the booleans which are False so we use a ~ which inverts the logic of our array.

    NAME    DEFAULT CLASS   AGE GROUP
1   Rekha       third   1   4
3   Jaya        fifth   1   1
6   Nita        fifth   1   2
7   Nita        fourth  1   4

all we need to do now is change these rows to blanks as your initial requirment, and we are left with.

    NAME    DEFAULT   CLASS AGE GROUP
0   Rekha             forth 2   2
1                     third 1   4
2   Jaya    c-default forth 3   3
3                     fifth 1   1
4   Sushma            fourth3   1
5   Nita    c-default third 4   2
6                     fifth 1   2
7                     fourth1   4

hope that helps!

Upvotes: 3

Related Questions