Reputation: 11660
I have a csv
that looks like (headers = first row):
name,a,a1,b,b1
arnold,300311,arnld01,300311,arnld01
sam,300713,sam01,300713,sam01
When I run:
df = pd.read_csv('file.csv')
Columns a
and b
have a .0
attached to the end like so:
df.head()
name,a,a1,b,b1
arnold,300311.0,arnld01,300311.0,arnld01
sam,300713.0,sam01,300713.0,sam01
Columns a
and b
are integers or blanks so why does pd.read_csv()
treat them like floats and how do I ensure they are integers on the read?
Upvotes: 34
Views: 62640
Reputation: 61
Pandas will convert an integer column to float if empty strings exist within it. Empty strings are interpreted as null values upon import and the conversion of int to float happens because the default missing data indicator is np.nan
which is a float underneath.
You do not have to drop your null values to deal with this issue!
Pandas recently introduced a new missing data indicator: pd.NA
pd.NA
does not have an innate data type and works a lot better than np.nan
. However, it should be used within nullable data types - another addition to pandas which allows handling of missing data without altering the state of your column.
Considering the above, you have at least two options:
Nullable data types are still experimental so you will have to covnert manually either with df = df.convert_dtypes()
or perform the operation on one column at a time:
df[col].astype("Int64")
- note the capital I
df[col].astype("Float64")
- note the capital F
df[col].astype(pd.BooleanDtype())
df[col].astype(pd.StringDtype())
or df[col].astype("string")
df = pd.read_csv("data.csv", dtype="string")
This gives all columns the nullable string type and prevents null values from taking np.nan
in the first place. You may then proceed to assign the desired data type yourself.
Upvotes: 1
Reputation: 559
I use pandas 1.5.3 and the below code keeps integer values though there are NaN values in the column.
df = pd.read_csv("file.csv", dtype=object)
Upvotes: 4
Reputation: 1258
I use this on a bit of Django code to render formatted dataframes into HTML.
I don't have prior knowledge of column types or number of columns.
There's a high likelihood of NaN
's in any column.
I want to set precision on the true floats and leave the integer values as they are.
Set NaN
's to an integer value that will never be used (assumed)
Use convert_dtypes()
to infer the types again without NaN
's
Convert dummy value back to NaN
(or just use Python None
for this purpose)
df.fillna(-999999, inplace=True)
df = df.convert_dtypes()
df = df.replace(-999999, None)
Now the integer columns that were inferred as floats because of missing values are correctly identified as integer.
Works on the assumption that there won't be any legitimate values of -999999 in the dataset, safe assumption for this usage.
Note that convert_dtypes()
changes float64
to Float64
and int64
to Int64
, something to remember when applying styles etc later:
dfs = df.style.format(subset=list(df.select_dtypes(include='Float64')), precision=self.precision, na_rep='')
Upvotes: 1
Reputation: 441
I know this is an old question, but there are missing options in the current answers.
you can specify the type of the column, in the example should be something like:
df = pd.read_csv('file.csv', dtype={'a': 'Int32', 'b': 'Int32'})
And pandas will set the missing values as <NA>
Upvotes: 14
Reputation: 39
Converting Float to Integer values using Pandas read_csv - Working ====================================================
# Importing the dataset
dataset = pd.read_csv('WorldWarWeather_Data.csv')
X = dataset.iloc[:, 3:11].values
y = dataset.iloc[:, 2].values
X=X.astype(int)
y=y.astype(int)
Upvotes: 2
Reputation: 50540
As root mentioned in the comments, this is a limitation of Pandas (and Numpy). NaN
is a float and the empty values you have in your CSV are NaN.
This is listed in the gotchas of pandas as well.
You can work around this in a few ways.
For the examples below I used the following to import the data - note that I added a row with an empty value in columns a
and b
import pandas as pd
from StringIO import StringIO
data = """name,a,a1,b,b1
arnold,300311,arnld01,300311,arnld01
sam,300713,sam01,300713,sam01
test,,test01,,test01"""
df = pd.read_csv(StringIO(data), sep=",")
Your first option is to drop rows that contain this NaN
value. The downside of this, is that you lose the entire row. After getting your data into a dataframe, run this:
df.dropna(inplace=True)
df.a = df.a.astype(int)
df.b = df.b.astype(int)
This drops all NaN
rows from the dataframe, then it converts column a
and column b
to an int
>>> df.dtypes
name object
a int32
a1 object
b int32
b1 object
dtype: object
>>> df
name a a1 b b1
0 arnold 300311 arnld01 300311 arnld01
1 sam 300713 sam01 300713 sam01
NaN
with placeholder dataThis option will replace all your NaN
values with a throw away value. That value is something you need to determine. For this test, I made it -999999
. This will allow use to keep the rest of the data, convert it to an int, and make it obvious what data is invalid. You'll be able to filter these rows out if you are making calculations based on the columns later.
df.fillna(-999999, inplace=True)
df.a = df.a.astype(int)
df.b = df.b.astype(int)
This produces a dataframe like so:
>>> df.dtypes
name object
a int32
a1 object
b int32
b1 object
dtype: object
>>> df
name a a1 b b1
0 arnold 300311 arnld01 300311 arnld01
1 sam 300713 sam01 300713 sam01
2 test -999999 test01 -999999 test01
Finally, another choice is to leave the float values (and NaN
) and not worry about the non-integer data type.
Upvotes: 21