The Great
The Great

Reputation: 7743

How to retain float precision with character values using pandas?

I have a data frame like as shown below

df = pd.DataFrame({'source_code':['A250.00','C791.0','716.90','493.90','143.21','134.52'],
                   'source_description':['test1', 'test1','test2','test3','test4,'test5'],
                   'key_id':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})

hash_file = pd.DataFrame({'source_id':['A250','C791','716.9','493.9','143.21','134.52'],
                          'source_code':['test1','test2','test3','test4','test5'],
                          'hash_id':[911,512,713,814,616,717]})
id_file =  hash_file.set_index(['source_id','source_code'])['hash_id']

I would like to update the values of the key_id column by comparing the source_code, source_description columns with source_id and source_code columns.

So, I tried the below based on this post

df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)

While this works fine in normal scenarios, but for specific scenarios when there is a mismatch like 250 and 250.00 or 791.0 and 791 etc, it doesn't work and produces incorrect output like below

enter image description here

So, I tried converting them to strings but it doesn't work still

I expect my output to be like below

enter image description here

Upvotes: 1

Views: 103

Answers (1)

jezrael
jezrael

Reputation: 863351

If possible, convert values to floats:

df['source_code'] = df['source_code'].astype(float)
hash_file['source_id'] = hash_file['source_id'].astype(float)

id_file =  hash_file.set_index(['source_id','source_code'])['hash_id']

df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)


print (df)
   source_code source_description  key_id
0       250.00              test1     911
1       791.00              test1     512
2       716.90              test2     713
3       493.90              test3     814
4       143.21              test4     616
5       134.52              test5     717

But there should be problem with float precision, one possible trick is multiple values some scalar like 1000 and then convert to integers:

df['source_code'] = df['source_code'].astype(float).mul(100).astype(int)
hash_file['source_id'] = hash_file['source_id'].astype(float).mul(100).astype(int)

id_file =  hash_file.set_index(['source_id','source_code'])['hash_id']

df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)


print (df)
   source_code source_description  key_id
0        25000              test1     911
1        79100              test1     512
2        71690              test2     713
3        49390              test3     814
4        14321              test4     616
5        13452              test5     717

EDIT:

If problem is only last 0 or last .0 use:

df['source_code'] = df['source_code'].str.replace('[\.]*[0]+$','', regex=True)
print (df)
  source_code source_description  key_id
0        A250              test1     NaN
1        C791              test1     NaN
2       716.9              test2     NaN
3       493.9              test3     NaN
4      143.21              test4     NaN
5      134.52              test5     NaN

id_file =  hash_file.set_index(['source_id','source_code'])['hash_id']

df['key_id'] = df.set_index(['source_code','source_description']).index.map(id_file)

print (df)
  source_code source_description  key_id
0        A250              test1     911
1        C791              test1     512
2       716.9              test2     713
3       493.9              test3     814
4      143.21              test4     616
5      134.52              test5     717

Better (I hope) regex for remove last .0 if exist:

import re

#https://stackoverflow.com/a/44111202/2901002
rgx = re.compile(r'(?:(\.)|(\.\d*?[1-9]\d*?))0+(?=\b|[^0-9])')
df['source_code'] = df['source_code'].str.replace(rgx, r'\2', regex=True)

Upvotes: 1

Related Questions