Prof. Falken
Prof. Falken

Reputation: 549

Finding duplicate rows in a Pandas Dataframe then Adding a column in the Dataframe that states if the row is a duplicate

I have a pandas dataframe that contains a column with possible duplicates. I would like to create a column that will produce a 1 if the row is duplicate and 0 if it is not.

So if I have:

     A|B
1    1|x
2    2|y
3    1|x
4    3|z

I would get:

     A|B|C
1    1|x|1
2    2|y|0
3    1|x|1
4    3|z|0

I tried df['C'] = np.where(df['A']==df['A'], '1', '0') but this just created a column of all 1's in C.

Upvotes: 5

Views: 4945

Answers (1)

jezrael
jezrael

Reputation: 862581

You need Series.duplicated with parameter keep=False for all duplicates first, then cast boolean mask (Trues and Falses) to 1s and 0s by astype by int and if necessary then cast to str:

df['C'] = df['A'].duplicated(keep=False).astype(int).astype(str)
print (df)
   A  B  C
1  1  x  1
2  2  y  0
3  1  x  1
4  3  z  0

If need check duplicates in columns A and B together use DataFrame.duplicated:

df['C'] = df.duplicated(subset=['A','B'], keep=False).astype(int).astype(str)
print (df)
   A  B  C
1  1  x  1
2  2  y  0
3  1  x  1
4  3  z  0

And numpy.where solution:

df['C'] = np.where(df['A'].duplicated(keep=False), '1', '0')
print (df)
   A  B  C
1  1  x  1
2  2  y  0
3  1  x  1
4  3  z  0

Upvotes: 10

Related Questions