Reputation: 64

Assign unique identifier for dataframe rows based on dataframe with preassigned unique identifier

I have dataframe with unique identifier assigned based on three columns i.e., [col2,col3,col3]

Dataframe1:

col1      col2     col3     col4      col5         unique_id
1         abc       bcv      zxc      www.com        8
2         bcd       qwe      rty      www.@com       12
3         klp       oiu      ytr      www.io         15
4         zxc       qwe      rty      www.com        6

After data preprocessing, will import Dataframe_2 with same column values as shown above but without unique_id. Dataframe_2 rows must be assigned with unique identifier based on col2,col3,col4 and by referring to the Dataframe1.

If Dataframe_2 has new row which is not present in Dataframe1, then assign new identifier.

Dataframe_2:

col1      col2     col3     col4      col5         
1         bcd       qwe      rty      www.@com              
2         zxc       qwe      rty      www.com
3         abc       bcv      zxc      www.com 
4         kph       hir      mat      www.com

Expected Dataframe_2:

col1      col2     col3     col4      col5         unique_id        
1         bcd       qwe      rty      www.@com        12     
2         zxc       qwe      rty      www.com         6
3         abc       bcv      zxc      www.com         8 
4         kph       hir      mat      www.com         35

Since Row4 is not present in Dataframe1, a new unique identifier is assigned.

Upvotes: 0

Answers (3)

jezrael

Reputation: 862521

First add column unique_id by DataFrame.merge with left join on parameter is omitted for merge by columns ['col2','col3','col4'] specified in subset. For not matched values are created missing values, so is used Series.isna for test them and np.arange for create new array after maximal value and assign them in DataFrame.loc

df = Dataframe_2.merge(Dataframe_1[['col2','col3','col4', 'unique_id']],
                       how='left')

mask = df['unique_id'].isna()
maximal = Dataframe_1['unique_id'].max() + 1

df.loc[mask, 'unique_id'] = np.arange(maximal, maximal + mask.sum())

df['unique_id'] = df['unique_id'].astype(int)
print (df)
   col1 col2 col3 col4      col5  unique_id
0     1  bcd  qwe  rty  www.@com         12
1     2  zxc  qwe  rty   www.com          6
2     3  abc  bcv  zxc   www.com          8
3     4  kph  hir  mat   www.com         16

Upvotes: 0

Prakash Dahal

Reputation: 4875

import math
import random
import pandas as pd
import numpy as np

df3 = pd.merge(df1,df2, on=['col2','col3','col4'], how='right')

def return_unique_num(df1):
  uniqueIds = list(df1['unique_id'].values)
  unique_num = random.randint(1,len(df1)+1)
  while True:
    if unique_num in uniqueIds:
      unique_num = random.randint(1,len(df1)+1)
    else:
      break
  return unique_num

for i, e in enumerate(df3['unique_id']):
  if math.isnan(e):
    df3.iloc[i, 5] = return_unique_num(df1) #replace nan value with unique integer in df3 unique_id column


df3['unique_id'] = df3['unique_id'].astype(int)

df2['unique_id'] = df3['unique_id']

It will assign unique IDs to df2 based on unique_id of df1

Output

col1      col2     col3     col4      col5         unique_id        
1         bcd       qwe      rty      www.@com        12     
2         zxc       qwe      rty      www.com         6
3         abc       bcv      zxc      www.com         8 
4         kph       hir      mat      www.com         35

Upvotes: 0

Ferris

Reputation: 5601

# assign the old unique_id
df2n = df2.join(df1.set_index(['col2', 'col3', 'col4', 'col5'])[['unique_id']],
         on=['col2', 'col3', 'col4', 'col5'], how='left')

# assign new unique_id with max df1.unique_id + 1
id_max = df1.unique_id.max() + 1
null_num = df2n['unique_id'].isnull().sum()

cond = df2n['unique_id'].isnull()
df2n.loc[cond,'unique_id'] = range(id_max, id_max + null_num)
df2n['unique_id'] = df2n['unique_id'].astype(int)

print(df2n)

      col1 col2 col3 col4      col5  unique_id
    0     1  bcd  qwe  rty  www.@com         12
    1     2  zxc  qwe  rty   www.com          6
    2     3  abc  bcv  zxc   www.com          8
    3     4  kph  hir  mat   www.com         16

Upvotes: 2

Assign unique identifier for dataframe rows based on dataframe with preassigned unique identifier

Answers (3)

Related Questions