Reputation: 93
I have a file that lists deposit balances as strings. IN order to plot these numbers, I'm trying to convert the Objects to a float. So I wrote code to remove the $ and to take out spaces before and after the values.
member_clean.TotalDepositBalances = member_clean.TotalDepositBalances.str.replace('$', '')
member_clean['TotalDepositBalances'] = member_clean['TotalDepositBalances'].str.strip()
member_clean['TotalDepositBalances'] = member_clean['TotalDepositBalances'].astype(float)
When I run the code, I get an error message that says
ValueError: could not convert string to float:
That's it. Before I added the str.strip, the error message showed me that some values had spaces before and after, so I knew to remove those. But I'm a little confused what else is causing it,
I looked at the values of the column after I removed the spaces and $, and everything looks normal. Here's a sample.
Any ideas of what I could check for in the columns that may be causing this error
Upvotes: 2
Views: 1175
Reputation: 4487
You have to delete the commas, they are not a numeric format recognized by Python. So considering the list you gave as possible input:
str_num = ['309.00 ', ' 38.00 ', ' 12,486.00 ', '6,108.00', ' 2,537.00']
you have to do this:
list(map(lambda s: float (s.replace (',', '')), str_num))
and gives your list of float:
[309.0, 38.0, 12486.0, 6108.0, 2537.0]
Note: You don't need to do str.strip()
because the spaces are automatically deleted from the float casting operation.
Following your pipeline, before converting to float, you need to do:
member_clean['TotalDepositBalances'] = member_clean['TotalDepositBalances'].str.replace(',', '')
Or you can run your entire pipeline on one line of code as follows:
member_clean['TotalDepositBalances'] = member_clean['TotalDepositBalances'].replace('$', '').replace(',', '').astype(float)
Here you will find tests that present a comparison of various methods for performing multiple substitutions inserted in a string. Surprisingly use replace
in cascade (as in your pipeline), it turns out to be more efficient than a regex for this type of operation. Give it a reading.
Upvotes: 3
Reputation: 12078
A useful method for working with large datasets or series is to create a lookup dictionary of corrected values so that duplicate values aren't re-calculated:
import pandas as pd
import re
def fast_num_conversion(s):
"""
This is an extremely fast approach to parsing messy numbers to floats.
For large data, the same values are often repeated. Rather than
re-parse these, we store all unique dates, parse them, and
use a lookup to convert all figures.
(Should be 10X faster than without lookup dict)
Note, input must be a pandas series.
"""
f_convert = lambda x: re.sub('[$\-,\| ]', '', x)
f_float = lambda x: float(x) if x!='' else np.NaN
vals = {curr:f_float(f_convert(curr)) for curr in s.unique()}
return s.map(vals)
str_num = ['309.00', '38 .00 ', '12, 486.00', '6,108.00', '2,537.00']
print(pd.Series(fast_num_conversion))
0 309.0
1 38.0
2 12486.0
3 6108.0
4 2537.0
Upvotes: 0