Pet
Pet

Reputation: 251

How do you find and mark duplicates within a panda data frame?

here is my setup:

import pandas as pd
import uuid

data = {'col1': ['val1','val1','val1','val2','val2', 'val3'],
        'col2': ['val4','val4','val4','val5','val5', 'val5']
        }
df = pd.DataFrame(data)
print (df)

>    col1  col2
> 0  val1  val4
> 1  val1  val4
> 2  val1  val4
> 3  val2  val5
> 4  val2  val5
> 5  val3  val5

I want to identify unique values and mark them with a UUID. I tried the following: I added a UUID to all records.

# add UUID
df["UUID"] = df.apply(lambda x: uuid.uuid4(),axis=1,)
print (df)

>    col1  col2                                  UUID
> 0  val1  val4  3670f6c5-03f7-4f6a-ac3e-1075f6131ca3
> 1  val1  val4  aa21f1e0-7c68-4778-aaf4-4d18ad05c65a
> 2  val1  val4  ce5dc0d8-4f1a-412f-8f7f-6176bc62038b
> 3  val2  val5  05010229-7ec3-421f-87a1-defc9483c794
> 4  val2  val5  8dc9244d-eec4-42ef-8e9d-015a661d8418
> 5  val3  val5  8950f30a-5eff-457d-b5ed-1d7a94e9b4b0

and I filtered for the values that are duplicated:

# get duplicated records
mask=df.duplicated(subset=['col1', 'col2'], keep=False)
df_duplicated=df[mask]
print(df_duplicated)

>    col1  col2                                  UUID
> 0  val1  val4  3670f6c5-03f7-4f6a-ac3e-1075f6131ca3
> 1  val1  val4  aa21f1e0-7c68-4778-aaf4-4d18ad05c65a
> 2  val1  val4  ce5dc0d8-4f1a-412f-8f7f-6176bc62038b
> 3  val2  val5  05010229-7ec3-421f-87a1-defc9483c794
> 4  val2  val5  8dc9244d-eec4-42ef-8e9d-015a661d8418

Unfortunately my fantasy has left me at this point. I would like to come to the following form (it doesn't matter which UUID connects the duplicates, it's only important that it's the same):

>    col1  col2                                  UUID
> 0  val1  val4  3670f6c5-03f7-4f6a-ac3e-1075f6131ca3
> 1  val1  val4  3670f6c5-03f7-4f6a-ac3e-1075f6131ca3
> 2  val1  val4  3670f6c5-03f7-4f6a-ac3e-1075f6131ca3
> 3  val2  val5  05010229-7ec3-421f-87a1-defc9483c794
> 4  val2  val5  05010229-7ec3-421f-87a1-defc9483c794
> 5  val3  val5  8950f30a-5eff-457d-b5ed-1d7a94e9b4b0

I am grateful for all ideas. In the best case we will come to an efficient procedure, because I have to apply it to large data sets. :-)

Best P

Upvotes: 2

Views: 194

Answers (1)

jezrael
jezrael

Reputation: 862581

One idea is generate uuid by length of DataFrame (added 10% size) and remove possible duplicates, then mapping groups by GroupBy.ngroup:

s = (pd.Series(uuid.uuid4() for x in range(int(len(df) * 1.1)))
       .drop_duplicates()
       .reset_index(drop=True))
df["UUID"] = df.groupby(['col1', 'col2']).ngroup().map(s)
print (df)

   col1  col2                                  UUID
0  val1  val4  604aedba-9c4e-405c-8d21-fe49749f5ccb
1  val1  val4  604aedba-9c4e-405c-8d21-fe49749f5ccb
2  val1  val4  604aedba-9c4e-405c-8d21-fe49749f5ccb
3  val2  val5  21c202bb-c8db-47bf-9496-04c7794046bc
4  val2  val5  21c202bb-c8db-47bf-9496-04c7794046bc
5  val3  val5  790577c4-eedf-4caa-8f63-e34a5d3b0d8f

If small DataFrame and remove duplacates is not necessary solution should be simplify:

s = pd.Series([uuid.uuid4() for x in range(len(df))])
df["UUID"] = df.groupby(['col1', 'col2']).ngroup().map(s)

Upvotes: 2

Related Questions