Reputation: 2451
I'm dealing with pandas DataFrames in which columns may contain strings representing numbers, may contain unexpected non-numeric characters, and the numbers represented by the strings may be of float or int type. For example, the DataFrame may contain something like "$625,452,242.95" for a float of 625452242.95 or "53.000.395" for an int equal to 53000395. The DataFrames are being read from a CSV file, and may be quite large.
What is the most efficient way to convert all such strings in a DataFrame to the appropriate numeric types? Thank you.
Upvotes: 2
Views: 7372
Reputation: 35265
Refer to the read_csv documentation.
If all the thousands separators are decimals, use thousands='.'
.
For a column with money, write a function that chops off the $ and converts the remaining string into an integer or a float. Pass it to read_csv
via converters
. (Again, see docs.)
I expect any custom converters will be slow -- read_csv is ruthlessly optimized in C -- so use built-in features (e.g., the thousands
keyword) wherever possible.
Upvotes: 2
Reputation: 93984
You can also try to replace those symbols and separator:
In [27]: df = pd.DataFrame([['$1,111'], ['$2,222']])
In [28]: df
Out[28]:
0
0 $1,111
1 $2,222
In [29]: df[0] = df[0].str.replace(r'[$,]', '').astype('float')
In [30]: df
Out[30]:
0
0 1111
1 2222
Upvotes: 5