Smiley
Smiley

Reputation: 599

Pandas: ValueError: Integer column has NA values in column 2

I get value errors when trying to read in the CSV file to my datatype. I need to be sure that it works and that every line is read in and is correct.

Errors are for example:

Pandas: ValueError: Integer column has NA values in column 2

I am trying to cast to integer in Pandas Python library, but there is a missing value.

However, the CSV file that I read in seems to have some erroneous entries, as it consists of manually entered test results.

I read in using this command:

test = pd.read_csv(
    "test.csv",
    sep=";",
    names=pandasframe_names,
    dtype=pandasframe_datatypes,
    skiprows=1,
)

names is A, B, C, D and E and is defined correctly.

If there is an erroneous entry, I need a way of handling this without losing the full row.

So here is my case: I have a Pandas dataframe that reads in a CSV table with 5 columns with the headers A, B, C, D, E. I skip row one with the parameter skiprows=1

pandas_datatypes = {
    "A": pd.np.int64,
    "B": pd.np.int64,
    "C": pd.np.float64,
    "D": object,
    "E": object,
}

My row has 5 column and the first 2 are int64 and the 3rd is float64 and the next 2 are object (e.g. string).

Those are equivalent to my dtype when I read it in. Meaning dtype=pandas_datatypes

Now I have entries like so:

entry 1: 5; 5; 2.2; pedagogy; teacher (correct)
entry 2: 8; 7.0; 2.2; pedagogy; teacher (incorrect, as second is float instead of int)
entry 3: NA; 5; 2.2; pedagogy; teacher (incorrect, as first value has entered NA as is missing)
entry 4: none; 5; 2.2; pedagogy; teacher (incorrect, as first value has entered none as is missing)
entry 5: 8; 5; 2; pedagogy; teacher (incorrect, as third is int instead of float)

How do I best handle this and what do I have to add to make this work for sure? In case that there is one incorrect entry, I don't want to lose the full line. Should I enter NULL? But then I would need to flag this for someone to manually look at it.

Upvotes: 23

Views: 32548

Answers (3)

hmadinei
hmadinei

Reputation: 41

There is a difference between int64 and Int64 in Pandas when dealing with NA values. You should consider that the capitalized dtype version, e.g., Int64, can handle NA values compared to the traditional int64.

Upvotes: 1

Etienne
Etienne

Reputation: 1032

Pandas now has extension types, for which integer support NA values. You will get pd.NA in those fields.

https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes

Use Pandas Int64 type, you'll be fine!

pandas_datatypes={'A': 'Int64', 'B': 'Int64', 'C':pd.np.float64, 'D':object, 'E':object}

Just tested it with pandas 1.3.5, works like a charm.

Upvotes: 30

Josh Friedlander
Josh Friedlander

Reputation: 11657

Since you have incomplete/corrupt data (very common!), you can't enforce dtypes from the outset. You have to first import it as is:

Given file1.csv:

5; 5; 2.2; pedagogy; teacher
8; 7.0; 2.2; pedagogy; teacher
NA; 5; 2.2; pedagogy; teacher
none; 5; 2.2; pedagogy; teacher
8; 5; 2; pedagogy; teacher

We can read it as

df = pd.read_csv('file1.csv', sep=';', header=None, names=['A', 'B', 'C', 'D', 'E'])

We then coerce A, B and C to numeric, forcing NaN wherever there is text.

for col in ['A', 'B', 'C']:
    df[col] = pd.to_numeric(df[col], errors='coerce')

You then have the C as a float col. Making A and B into ints is a bit trickier, since they have NaN values (a known Pandas issue).

If you have 0.24 or higher, you can do:

df['A'] = df['A'].astype(pd.Int64Dtype())
df['B'] = df['B'].astype(pd.Int64Dtype())

Otherwise you can either fill the NaNs in some way, or live with the floats.

Upvotes: 5

Related Questions