Reputation: 2451

Efficiently converting strings to appropriate numeric types in pandas DataFrame

I'm dealing with pandas DataFrames in which columns may contain strings representing numbers, may contain unexpected non-numeric characters, and the numbers represented by the strings may be of float or int type. For example, the DataFrame may contain something like "$625,452,242.95" for a float of 625452242.95 or "53.000.395" for an int equal to 53000395. The DataFrames are being read from a CSV file, and may be quite large.

What is the most efficient way to convert all such strings in a DataFrame to the appropriate numeric types? Thank you.

Upvotes: 2

Answers (2)

Dan Allan

Reputation: 35265

Refer to the read_csv documentation.

If all the thousands separators are decimals, use thousands='.'.
For a column with money, write a function that chops off the $ and converts the remaining string into an integer or a float. Pass it to read_csv via converters. (Again, see docs.)

I expect any custom converters will be slow -- read_csv is ruthlessly optimized in C -- so use built-in features (e.g., the thousands keyword) wherever possible.

Upvotes: 2

waitingkuo

Reputation: 93984

You can also try to replace those symbols and separator:

In [27]: df = pd.DataFrame([['$1,111'], ['$2,222']])

In [28]: df
Out[28]: 
        0
0  $1,111
1  $2,222

In [29]: df[0] = df[0].str.replace(r'[$,]', '').astype('float')

In [30]: df
Out[30]: 
      0
0  1111
1  2222

Upvotes: 5

Efficiently converting strings to appropriate numeric types in pandas DataFrame

Answers (2)

Related Questions