RustyShackleford
RustyShackleford

Reputation: 3677

How to replace values in Pandas column with random numbers per unique values (random categorical)?

I have a df with a column that looks like this:

id   
11    
22
22
333
33
333

This column is sensitive data. I want to replace each value with any random number but each random number should be maintain the same number across the same IDs.

For example, I want to make mask the data in the column like so:

id   
123   
987
987
456
00
456

Note the same IDs have the same value. How do I achieve this? I have thousands of IDs.

Upvotes: 2

Views: 2804

Answers (3)

SergFSM
SergFSM

Reputation: 1491

I would suggest something like this (But it will not work properly - it will creates values randomly so new values can repeat themselves for different unique initial values):

from random import randint

df['id_rand'] = df.groupby('id')['id'].transform(lambda x: randint(1,1000))
>>> df
'''
    id  id_rand
0   11      833
1   22      577
2   22      577
3  333      101
4   33      723
5  333      101

Upvotes: 1

Thomas
Thomas

Reputation: 25

My idea:

  1. take the unique values ​​from your column,
  2. shuffle unique values,
  3. create a list of new values ​​for each element (from 0 to the number of unique values),
  4. create dictionary with initial values as dictionary keys and new values as dictionary values,
  5. map values using created dictionary to your column.
from random import shuffle

my_col = 'your_sensitive_col_name' # (int type) 

initial_unique_vals = df[my_col].unique() 
new_values = list(range(0,len(initial_unique_vals))) shuffle(initial_unique_vals)
dict_init_new_values = dict(zip(initial_unique_vals, new_values))
df[my_col] = df[my_col].map(dict_init_new_values)

Upvotes: 2

mozway
mozway

Reputation: 262484

Here are two options to either generate a categorical (non random, id2), or a unique random per original ID (id3). In both case we can use pandas.factorize (or alternatively unique, or pandas.Categorical).

# enumerated categorical
df['id2'] = pd.factorize(df['id'])[0]

# random categorical
import numpy as np
s,ids = pd.factorize(df['id'])
d = dict(zip(ids, np.random.choice(range(1000), size=len(ids), replace=False)))
df['id3'] = df['id'].map(d)

# alternative 1
ids = df['id'].unique()
d = dict(zip(ids, np.random.choice(range(1000), size=len(ids), replace=False)))
df['id3'] = df['id'].map(d)

# alternative 2
df['id3'] = pd.Categorical(df['id'])
new_ids = np.random.choice(range(1000), size=len(df['id3'].cat.categories), replace=False)
df['id3'] = df['id3'].cat.rename_categories(new_ids)

Output:

    id  id2  id3
0   11    0  395
1   22    1  428
2   22    1  428
3  333    2  528
4   33    3  783
5  333    2  528

Upvotes: 1

Related Questions