Reputation: 42946
Normally I anonymize my data by using hashlib and using the .apply(hash) function.
Now im trying a new approach, imagine I have to following df called 'data':
df = pd.DataFrame({'contributor':['eric', 'frank', 'john', 'frank', 'barbara'],
'amount payed':[10,28,49,77,31]})
contributor amount payed
0 eric 10
1 frank 28
2 john 49
3 frank 77
4 barbara 31
Which I want to anonymize by turning the names all into person1
, person2
etc, like this:
output = pd.DataFrame({'contributor':['person1', 'person2', 'person3', 'person2', 'person4'],
'amount payed':[10,28,49,77,31]})
contributor amount payed
0 person1 10
1 person2 28
2 person3 49
3 person2 77
4 person4 31
So my first though was summarizing the name column so the names are attached to a unique index and I can use that index for the number after 'person'.
Upvotes: 3
Views: 5459
Reputation: 55
labels, uniques = pd.factorize(df['name'])
labels = ['person_'+str(l) for l in labels]
df['contributor_anonymized'] = labels
Upvotes: 1
Reputation: 863701
I think faster solution is use factorize
for unique values, add 1
, convert to Series
and string
s and prepend Person
string:
df['contributor'] = 'Person' + pd.Series(pd.factorize(df['contributor'])[0] + 1).astype(str)
print (df)
contributor amount payed
0 Person1 10
1 Person2 28
2 Person3 49
3 Person2 77
4 Person4 31
Upvotes: 8
Reputation: 3
Maybe try to create a data frame called "index" for this operation and keep unique name
values inside it?
Then produce masks with unique name indexes and merge the resulting data frame index
with data
.
index = pd.DataFrame()
index['name'] = df['name'].unique()
index['mask'] = index['name'].apply(lambda x : 'person' +
str(index[index.name == x].index[0] + 1))
data.merge(index, how='left')[['mask', 'amount']]
Upvotes: 0