Reputation: 680
I have a dataframe where there is one column of random letters and numbers, then a column of how many random letters/numbers need to be added to random string in the first column. Like so but my dataframe is 3+ million rows:
id missing
XK39J 4
NI94N 4
9IN3 5
MN83D 4
IUN2 5
I am using the following code to generate the new random sequence:
def id_generator(size, chars=string.ascii_uppercase + string.digits):
return ''.join(random.SystemRandom().choice(string.ascii_uppercase + string.digits) for _ in range(size))
data['new_id'] = data['missing'].apply(lambda x: id_generator(size = x))
data['final_id'] = data['id'] + data['new_id']
However, when I use this I end up getting a couple values in the 'final_id' column that are duplicates. But, I need each value in the 'final_id' column to be unique. Like:
id missing new_id final_id
XK39J 4 NJI4 XK39JNJI4
NI94N 4 BNER NI94NBNER
9IN3 5 ER41J 9IN3ER41J
MN83D 4 9D4S MN83D9D4S
IUN2 5 MNST3 IUN2MNST3
My idea was to store all the ids in a list and then get a new randomly generated sequence if it matches but given that there will be 3million+ ids it won't work since iterating through a row of 3m will take too long. Like:
def id_generator(size, chars=string.ascii_uppercase + string.digits):
val_ls = []
val = ''.join(random.SystemRandom().choice(string.ascii_uppercase + string.digits) for _ in range(size))
while val in val_ls:
val = ''.join(random.SystemRandom().choice(string.ascii_uppercase + string.digits) for _ in range(size))
else:
val_ls.append(val)
return val
How can I ensure that there are no repeats?
Upvotes: 0
Views: 191
Reputation: 9308
This is still a brute-force but you could try something like this.
from uuid import uuid4
# First generate final_id for all without caring about duplicates
df['new_id'] = df.missing.transform(lambda x: str(uuid4()).upper().replace('-', '')[:x])
df['final_id'] = df.id + df.new_id
# final_ids that are unique and already good
id_good = df.final_id.unique().tolist()
# Try re-generating final_id until we get no more duplicates
while(len(df[df.final_id.duplicated()]) > 0):
dupe_mask = df.final_id.duplicated()
# Regenerate final_id, store in temp column
df.loc[dupe_mask, 'new_id'] = df.loc[dupe_mask].missing.transform(lambda x: str(uuid4()).upper().replace('-', '')[:x])
df.loc[dupe_mask, 'temp'] = df.loc[dupe_mask].id + df.loc[dupe_mask].new_id
# If the new final_id is not duplicates with currently good final_ids, keep it.
df.loc[dupe_mask & ~df.temp.isin(id_good), 'final_id'] = df.loc[dupe_mask & ~df.temp.isin(id_good), 'temp']
id_good += df.loc[dupe_mask & ~df.temp.isin(id_good), 'final_id'].unique().tolist()
df = df.drop('temp', axis=1)
When I tested with 3M rows, it only had to do 1 loop, however, you might want to add timeout as this could theoretically ran forever.
Upvotes: 1