Reputation: 251
here is my setup:
import pandas as pd
import uuid
data = {'col1': ['val1','val1','val1','val2','val2', 'val3'],
'col2': ['val4','val4','val4','val5','val5', 'val5']
}
df = pd.DataFrame(data)
print (df)
> col1 col2
> 0 val1 val4
> 1 val1 val4
> 2 val1 val4
> 3 val2 val5
> 4 val2 val5
> 5 val3 val5
I want to identify unique values and mark them with a UUID. I tried the following: I added a UUID to all records.
# add UUID
df["UUID"] = df.apply(lambda x: uuid.uuid4(),axis=1,)
print (df)
> col1 col2 UUID
> 0 val1 val4 3670f6c5-03f7-4f6a-ac3e-1075f6131ca3
> 1 val1 val4 aa21f1e0-7c68-4778-aaf4-4d18ad05c65a
> 2 val1 val4 ce5dc0d8-4f1a-412f-8f7f-6176bc62038b
> 3 val2 val5 05010229-7ec3-421f-87a1-defc9483c794
> 4 val2 val5 8dc9244d-eec4-42ef-8e9d-015a661d8418
> 5 val3 val5 8950f30a-5eff-457d-b5ed-1d7a94e9b4b0
and I filtered for the values that are duplicated:
# get duplicated records
mask=df.duplicated(subset=['col1', 'col2'], keep=False)
df_duplicated=df[mask]
print(df_duplicated)
> col1 col2 UUID
> 0 val1 val4 3670f6c5-03f7-4f6a-ac3e-1075f6131ca3
> 1 val1 val4 aa21f1e0-7c68-4778-aaf4-4d18ad05c65a
> 2 val1 val4 ce5dc0d8-4f1a-412f-8f7f-6176bc62038b
> 3 val2 val5 05010229-7ec3-421f-87a1-defc9483c794
> 4 val2 val5 8dc9244d-eec4-42ef-8e9d-015a661d8418
Unfortunately my fantasy has left me at this point. I would like to come to the following form (it doesn't matter which UUID connects the duplicates, it's only important that it's the same):
> col1 col2 UUID
> 0 val1 val4 3670f6c5-03f7-4f6a-ac3e-1075f6131ca3
> 1 val1 val4 3670f6c5-03f7-4f6a-ac3e-1075f6131ca3
> 2 val1 val4 3670f6c5-03f7-4f6a-ac3e-1075f6131ca3
> 3 val2 val5 05010229-7ec3-421f-87a1-defc9483c794
> 4 val2 val5 05010229-7ec3-421f-87a1-defc9483c794
> 5 val3 val5 8950f30a-5eff-457d-b5ed-1d7a94e9b4b0
I am grateful for all ideas. In the best case we will come to an efficient procedure, because I have to apply it to large data sets. :-)
Best P
Upvotes: 2
Views: 194
Reputation: 862581
One idea is generate uuid
by length of DataFrame (added 10% size) and remove possible duplicates, then mapping groups by GroupBy.ngroup
:
s = (pd.Series(uuid.uuid4() for x in range(int(len(df) * 1.1)))
.drop_duplicates()
.reset_index(drop=True))
df["UUID"] = df.groupby(['col1', 'col2']).ngroup().map(s)
print (df)
col1 col2 UUID
0 val1 val4 604aedba-9c4e-405c-8d21-fe49749f5ccb
1 val1 val4 604aedba-9c4e-405c-8d21-fe49749f5ccb
2 val1 val4 604aedba-9c4e-405c-8d21-fe49749f5ccb
3 val2 val5 21c202bb-c8db-47bf-9496-04c7794046bc
4 val2 val5 21c202bb-c8db-47bf-9496-04c7794046bc
5 val3 val5 790577c4-eedf-4caa-8f63-e34a5d3b0d8f
If small DataFrame and remove duplacates is not necessary solution should be simplify:
s = pd.Series([uuid.uuid4() for x in range(len(df))])
df["UUID"] = df.groupby(['col1', 'col2']).ngroup().map(s)
Upvotes: 2