Reputation: 7743
I have a data frame like as shown below
df = pd.DataFrame({'source_code':['A250.00','C791.0','716.90','493.90','143.21','134.52'],
'source_description':['test1', 'test1','test2','test3','test4,'test5'],
'key_id':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})
hash_file = pd.DataFrame({'source_id':['A250','C791','716.9','493.9','143.21','134.52'],
'source_code':['test1','test2','test3','test4','test5'],
'hash_id':[911,512,713,814,616,717]})
id_file = hash_file.set_index(['source_id','source_code'])['hash_id']
I would like to update the values of the key_id
column by comparing the source_code
, source_description
columns with source_id
and source_code
columns.
So, I tried the below based on this post
df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)
While this works fine in normal scenarios, but for specific scenarios when there is a mismatch like 250
and 250.00
or 791.0
and 791
etc, it doesn't work and produces incorrect output like below
So, I tried converting them to strings but it doesn't work still
I expect my output to be like below
Upvotes: 1
Views: 103
Reputation: 863351
If possible, convert values to floats:
df['source_code'] = df['source_code'].astype(float)
hash_file['source_id'] = hash_file['source_id'].astype(float)
id_file = hash_file.set_index(['source_id','source_code'])['hash_id']
df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)
print (df)
source_code source_description key_id
0 250.00 test1 911
1 791.00 test1 512
2 716.90 test2 713
3 493.90 test3 814
4 143.21 test4 616
5 134.52 test5 717
But there should be problem with float precision, one possible trick is multiple values some scalar like 1000
and then convert to integers:
df['source_code'] = df['source_code'].astype(float).mul(100).astype(int)
hash_file['source_id'] = hash_file['source_id'].astype(float).mul(100).astype(int)
id_file = hash_file.set_index(['source_id','source_code'])['hash_id']
df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)
print (df)
source_code source_description key_id
0 25000 test1 911
1 79100 test1 512
2 71690 test2 713
3 49390 test3 814
4 14321 test4 616
5 13452 test5 717
EDIT:
If problem is only last 0
or last .0
use:
df['source_code'] = df['source_code'].str.replace('[\.]*[0]+$','', regex=True)
print (df)
source_code source_description key_id
0 A250 test1 NaN
1 C791 test1 NaN
2 716.9 test2 NaN
3 493.9 test3 NaN
4 143.21 test4 NaN
5 134.52 test5 NaN
id_file = hash_file.set_index(['source_id','source_code'])['hash_id']
df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)
print (df)
source_code source_description key_id
0 A250 test1 911
1 C791 test1 512
2 716.9 test2 713
3 493.9 test3 814
4 143.21 test4 616
5 134.52 test5 717
Better (I hope) regex for remove last .0
if exist:
import re
#https://stackoverflow.com/a/44111202/2901002
rgx = re.compile(r'(?:(\.)|(\.\d*?[1-9]\d*?))0+(?=\b|[^0-9])')
df['source_code'] = df['source_code'].str.replace(rgx, r'\2', regex=True)
Upvotes: 1