Reputation: 3927
I've recently started to play around with Pandas in order to manipulate some data and I am now trying to anonymize a few columns after a groupBy
to find unique occurrences for persons.
For example, suppose the following DF:
First Name Last Name DOB
0 Bob One 28/05/1973
1 Bob One 28/05/1973
2 Ana Two 28/07/1991
3 Ana Two 28/07/1991
4 Ana Two 28/07/1991
5 Jim Three 07/01/1994
I can easily find unique person by First Name, Last Name and DOB by using df.groupby(['First Name', 'Last Name', 'DOB'])
.
However, I'd like to apply a function to every unique combination that would transform those names to a known anonymized (incremental) version.
First Name Last Name DOB
0 F1 L1 28/05/1973
1 F1 L1 28/05/1973
2 F2 L2 28/07/1991
3 F2 L2 28/07/1991
4 F2 L2 28/07/1991
5 F3 L3 07/01/1994
I've tried a few things with transform
and apply
functions of DF groupBy but with no lucky so far. How could I achieve this?
Upvotes: 1
Views: 70
Reputation: 18315
ids = (df.groupby(["FirstName", "LastName", "DOB"], sort=False)
.ngroup().add(1)
.astype(str))
df["FirstName"] = "F" + ids
df["LastName"] = "L" + ids
identify the IDs of firstname, lastname and DOB triples over their group number (ngroup)
then add prefixes "F" and "L" to the IDs and assign to appropriate columns
to get
>>> df
FirstName LastName DOB
0 F1 L1 28/05/1973
1 F1 L1 28/05/1973
2 F2 L2 28/07/1991
3 F2 L2 28/07/1991
4 F2 L2 28/07/1991
5 F3 L3 07/01/1994
where the ids
was
>>> ids
0 1
1 1
2 2
3 2
4 2
5 3
dtype: object
Upvotes: 2
Reputation: 120559
You can use pd.factorize
:
>>> df.assign(**{'First Name': pd.factorize(df['First Name'])[0],
'Last Name': pd.factorize(df['Last Name'])[0]})
First Name Last Name DOB
0 0 0 28/05/1973
1 0 0 28/05/1973
2 1 1 28/07/1991
3 1 1 28/07/1991
4 1 1 28/07/1991
5 2 2 07/01/1994
Upvotes: 2