Reputation: 599
I get value errors when trying to read in the CSV file to my datatype. I need to be sure that it works and that every line is read in and is correct.
Errors are for example:
Pandas: ValueError: Integer column has NA values in column 2
I am trying to cast to integer in Pandas Python library, but there is a missing value.
However, the CSV file that I read in seems to have some erroneous entries, as it consists of manually entered test results.
I read in using this command:
test = pd.read_csv(
"test.csv",
sep=";",
names=pandasframe_names,
dtype=pandasframe_datatypes,
skiprows=1,
)
names is A, B, C, D and E and is defined correctly.
If there is an erroneous entry, I need a way of handling this without losing the full row.
So here is my case:
I have a Pandas dataframe that reads in a CSV table with 5 columns with the
headers A, B, C, D, E. I skip row one with the parameter skiprows=1
pandas_datatypes = {
"A": pd.np.int64,
"B": pd.np.int64,
"C": pd.np.float64,
"D": object,
"E": object,
}
My row has 5 column and the first 2 are int64 and the 3rd is float64 and the next 2 are object (e.g. string).
Those are equivalent to my dtype when I read it in. Meaning dtype=pandas_datatypes
Now I have entries like so:
entry 1: 5; 5; 2.2; pedagogy; teacher (correct)
entry 2: 8; 7.0; 2.2; pedagogy; teacher (incorrect, as second is float instead of int)
entry 3: NA; 5; 2.2; pedagogy; teacher (incorrect, as first value has entered NA as is missing)
entry 4: none; 5; 2.2; pedagogy; teacher (incorrect, as first value has entered none as is missing)
entry 5: 8; 5; 2; pedagogy; teacher (incorrect, as third is int instead of float)
How do I best handle this and what do I have to add to make this work for sure? In case that there is one incorrect entry, I don't want to lose the full line. Should I enter NULL? But then I would need to flag this for someone to manually look at it.
Upvotes: 23
Views: 32548
Reputation: 41
There is a difference between int64
and Int64
in Pandas when dealing with NA values. You should consider that the capitalized dtype version, e.g., Int64
, can handle NA values compared to the traditional int64
.
Upvotes: 1
Reputation: 1032
Pandas now has extension types, for which integer support NA values. You will get pd.NA in those fields.
https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes
Use Pandas Int64 type, you'll be fine!
pandas_datatypes={'A': 'Int64', 'B': 'Int64', 'C':pd.np.float64, 'D':object, 'E':object}
Just tested it with pandas 1.3.5, works like a charm.
Upvotes: 30
Reputation: 11657
Since you have incomplete/corrupt data (very common!), you can't enforce dtypes from the outset. You have to first import it as is:
Given file1.csv
:
5; 5; 2.2; pedagogy; teacher
8; 7.0; 2.2; pedagogy; teacher
NA; 5; 2.2; pedagogy; teacher
none; 5; 2.2; pedagogy; teacher
8; 5; 2; pedagogy; teacher
We can read it as
df = pd.read_csv('file1.csv', sep=';', header=None, names=['A', 'B', 'C', 'D', 'E'])
We then coerce A, B and C to numeric, forcing NaN
wherever there is text.
for col in ['A', 'B', 'C']:
df[col] = pd.to_numeric(df[col], errors='coerce')
You then have the C as a float col. Making A and B into ints is a bit trickier, since they have NaN
values (a known Pandas issue).
If you have 0.24 or higher, you can do:
df['A'] = df['A'].astype(pd.Int64Dtype())
df['B'] = df['B'].astype(pd.Int64Dtype())
Otherwise you can either fill the NaN
s in some way, or live with the floats.
Upvotes: 5