Simd
Simd

Reputation: 21253

How to convert non-numeric entries to NaN in read_csv

I am reading in a file with:

pd.read_csv("file.csv", dtype={'ID_1':float})

The file looks like:

ID_0, ID_1,ID_2
a,002,c
b,004,d
c,   ,e       
n,003,g

Unfortunately read_csv fails complaining it can't convert ' ' to a float.

What is the right way to read in a csv and convert anything that can't be converted to a float into NaN?

Upvotes: 2

Views: 868

Answers (2)

Ma0
Ma0

Reputation: 15204

This is my understanding of reading the documentation:

def my_func(x):
    try:
        converted_value = float(x)
    except ValueError:
        converted_value = 'NaN'
    return converted_value

pd.read_csv("file.csv", dtype={'ID_1':float}, converters={'ID_1':my_func})

(As I am at work now and don't have access to pandas I cannot tell you if it works but it looks as it should (said every programmer ever..))

See also these relevant SO questions:

Upvotes: 4

EdChum
EdChum

Reputation: 394051

If you don't specify the dtype param and pass skipinitialspace=True then it will just work:

In [4]:
t="""ID_0,ID_1,ID_2
a,002,c
b,004,d
c,   ,e
n,003,g"""

pd.read_csv(io.StringIO(t), skipinitialspace=True)
Out[4]:
  ID_0  ID_1 ID_2
0    a   2.0    c
1    b   4.0    d
2    c   NaN    e
3    n   3.0    g

So in your case:

pd.read_csv("file.csv", skipinitialspace=True)

will just work

You can see that the dtypes are as expected:

In [5]:
pd.read_csv(io.StringIO(t), skipinitialspace=True).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
ID_0    4 non-null object
ID_1    3 non-null float64
ID_2    4 non-null object
dtypes: float64(1), object(2)
memory usage: 176.0+ bytes

Upvotes: 5

Related Questions