Florian
Florian

Reputation: 311

Pandas: clean & convert DataFrame to numbers

I have a dataframe containing strings, as read from a sloppy csv:

id  Total           B                  C        ...                                        
0   56 974          20 739             34 482   
1   29 479          10 253             16 704   
2   86 961          29 837             43 593   
3   52 687          22 921             28 299   
4   23 794           7 646             15 600   

What I want to do: convert every cell in the frame into a number. It should be ignoring whitespaces, but put NaN where the cell contains something really strange. I probably know how to do it using terribly unperformant manual looping and replacing values, but was wondering if there's a nice and clean why to do this.

Upvotes: 1

Views: 104

Answers (1)

jezrael
jezrael

Reputation: 862611

You can use read_csv with regex separator \s{2,} - 2 or more whitespaces and parameter thousands:

import pandas as pd
from pandas.compat import StringIO

temp=u"""id  Total           B                  C                                           
0   56 974          20 739             34 482   
1   29 479          10 253             16 704   
2   86 961          29 837             43 593   
3   52 687          22 921             28 299   
4   23 794           7 646             15 600   """
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="\s{2,}", engine='python', thousands=' ')

print (df)
   id  Total      B      C
0   0  56974  20739  34482
1   1  29479  10253  16704
2   2  86961  29837  43593
3   3  52687  22921  28299
4   4  23794   7646  15600

print (df.dtypes)
id       int64
Total    int64
B        int64
C        int64
dtype: object

And then if necessary apply function to_numeric with parameter errors='coerce' - it replace non numeric to NaN:

df = df.apply(pd.to_numeric, errors='coerce')

Upvotes: 2

Related Questions