László
László

Reputation: 4144

why does pandas.to_csv write floats for integers?

I have the code below to parse some csv data. The key is the last few lines though, the rest is only there to show the context. Basically, there are three columns in my data in the end, the ID variable LopNr and year should have integers "anyway" but I convert the entire DataFrame to integer just in case. Why do I get ".0" for the LopNr and year columns in the resulting csv file, while the third column with aggregated data actually is converted to integers and is output without ".0"? I would have thought that after .astype(int) all columns will have integers, and our exported to csv without converting them back to floats.

import iopro
from pandas import *

neuro   = DataFrame()
for year in xrange(2005,2012):
    for month in xrange(1,13):
        if year == 2005 and month < 7:
            continue
        filename = 'Q:\\drugs\\lmed_' + str(year) + '_mon'+ str(month) +'.txt'
        adapter = iopro.text_adapter(filename,parser='csv',field_names=True,output='dataframe',delimiter='\t')
        monthly = adapter[['LopNr','ATC','TKOST']][:]
        monthly['year']=year
        neuro = neuro.append(monthly[(monthly.ATC.str.startswith('N')) & (~(monthly.TKOST.isnull()))])

neuro = neuro.groupby(['LopNr','year']).sum()
neuro = neuro.astype(int)
neuro.to_csv('Q:\\drugs\\annual_neuro_costs.csv')

Upvotes: 6

Views: 6192

Answers (1)

ostrokach
ostrokach

Reputation: 19932

This is probably because your 'LopNr' and 'year' columns have null values. At present, pandas does not support integer columns with null values and instead upconverts the entire column to float.

http://pandas.pydata.org/pandas-docs/stable/gotchas.html#nan-integer-na-values-and-na-type-promotions


Edit:

As of version 0.24.0, there is preliminary support in Pandas for nullable integer data type.

By default, integers still get converted to floats if there are missing values:

>> df = pd.DataFrame([[1, 2, None], [5, None, 7]])
>> print(df)
   0    1    2
0  1  2.0  NaN
1  5  NaN  7.0

However, if we specify dtype="Int64", this no longer happens:

>> df = pd.DataFrame([[1, 2, None], [5, None, 7]], dtype="Int64")
>> print(df)
   0     1     2
0  1     2  <NA>
1  5  <NA>     7

Upvotes: 5

Related Questions