How to replace values in Pandas column with random numbers per unique values (random categorical)?

I have a df with a column that looks like this:

This column is sensitive data. I want to replace each value with any random number but each random number should be maintain the same number across the same IDs.

For example, I want to make mask the data in the column like so:

Note the same IDs have the same value. How do I achieve this? I have thousands of IDs.

Upvotes: 2

Answers (3)

SergFSM

Reputation: 1491

I would suggest something like this (But it will not work properly - it will creates values randomly so new values can repeat themselves for different unique initial values):

from random import randint

df['id_rand'] = df.groupby('id')['id'].transform(lambda x: randint(1,1000))
>>> df
'''
    id  id_rand
0   11      833
1   22      577
2   22      577
3  333      101
4   33      723
5  333      101

Upvotes: 1

Thomas

Reputation: 25

My idea:

take the unique values from your column,
shuffle unique values,
create a list of new values for each element (from 0 to the number of unique values),
create dictionary with initial values as dictionary keys and new values as dictionary values,
map values using created dictionary to your column.

from random import shuffle

my_col = 'your_sensitive_col_name' # (int type) 

initial_unique_vals = df[my_col].unique() 
new_values = list(range(0,len(initial_unique_vals))) shuffle(initial_unique_vals)
dict_init_new_values = dict(zip(initial_unique_vals, new_values))
df[my_col] = df[my_col].map(dict_init_new_values)

Upvotes: 2

mozway

Reputation: 262484

Here are two options to either generate a categorical (non random, id2), or a unique random per original ID (id3). In both case we can use pandas.factorize (or alternatively unique, or pandas.Categorical).

# enumerated categorical
df['id2'] = pd.factorize(df['id'])[0]

# random categorical
import numpy as np
s,ids = pd.factorize(df['id'])
d = dict(zip(ids, np.random.choice(range(1000), size=len(ids), replace=False)))
df['id3'] = df['id'].map(d)

# alternative 1
ids = df['id'].unique()
d = dict(zip(ids, np.random.choice(range(1000), size=len(ids), replace=False)))
df['id3'] = df['id'].map(d)

# alternative 2
df['id3'] = pd.Categorical(df['id'])
new_ids = np.random.choice(range(1000), size=len(df['id3'].cat.categories), replace=False)
df['id3'] = df['id3'].cat.rename_categories(new_ids)

Output:

    id  id2  id3
0   11    0  395
1   22    1  428
2   22    1  428
3  333    2  528
4   33    3  783
5  333    2  528

Upvotes: 1

How to replace values in Pandas column with random numbers per unique values (random categorical)?

Answers (3)

Related Questions