Reputation: 953

pandas.read_csv() strange behavior for empty (default) values

I have following input trans.csv file:

Date,Currenncy,Symbol,Type,Units,UnitPrice,Cost,Tax
2012-03-14,USD,AAPL,BUY,1000
2012-05-12,USD,SBUX,SELL,500

The fields UnitPrice, Cost and Tax are optional. If they are not specified I expect NaN in the DataFrame cell.

I read the csv file with:

t = pandas.read_csv('trans.csv', parse_dates=True, index_col=0)

and got the following result:

           Currenncy Symbol  Type  Units   UnitPrice       Cost       Tax
Date                                                                     
2012-03-14       USD   AAPL   BUY   1000  2012-05-12  012-05-12  12-05-12
2012-02-05       USD   SBUX  SELL    500         NaN        NaN       NaN

Why are there no NaN in the first row and is the Date repeated? Any workaround to get NaN for the unspecified fields?

Upvotes: 2

Answers (2)

Andy Hayden

Reputation: 375445

Here's a method which can handle some more cases (when there is some data in UnitCost, Cost, etc.).

In [1]: df = pd.read_csv('trans.csv', header=None)

In [2]: df.columns = df.ix[0]

In [3]: df[1:].set_index('Date')
Out[3]: 
           Currenncy Symbol  Type Units UnitPrice Cost  Tax
Date                                                       
2012-03-14       USD   AAPL   BUY  1000       NaN  NaN  NaN
2012-05-12       USD   SBUX  SELL   500       NaN  NaN  NaN
2012-05-12       USD   SBUX  SELL   500       NaN  NaN  NaN

It's worth noting that the dtype of the these columns will be object.

However, I think this should be caught by to_csv so I posted as an issue on github.

Upvotes: 2

Fredrick Brennan

Reputation: 7357

Your CSV file is malformed. I get the same answer as you in Pandas 0.10, and while I admit that it is indeed very, very strange, you shouldn't be feeding it malformed data.

Date,Currenncy,Symbol,Type,Units,UnitPrice,Cost,Tax
2012-03-14,USD,AAPL,BUY,1000,,,
2012-05-12,USD,SBUX,SELL,500,,,

returns the expected

>>> import pandas as pd
>>> t = pd.read_csv('pandas_test', parse_dates=True, index_col=0)
>>> t
           Currenncy Symbol  Type  Units  UnitPrice  Cost  Tax
Date                                                          
2012-03-14       USD   AAPL   BUY   1000        NaN   NaN  NaN
2012-05-12       USD   SBUX  SELL    500        NaN   NaN  NaN

Upvotes: 3

pandas.read_csv() strange behavior for empty (default) values

Answers (2)

Related Questions