Pandas ExcelFile.parse has NaNs in index when index_col is specified

Question

I have an excel file that I am reading into a pandas DataFrame that has the header on row 1 (python index) and a blank row between the header and the data. When I specify index_col it treats the blank row as part of the index as a NaN. What is the best way to avoid this behavior?

Test file:

idx value

a   1

Without specifying index_col:

print xs.parse(header = 1)
   idx  value
0  NaN    NaN
1    a      1

print xs.parse(header = 1).index
Int64Index([0, 1], dtype='int64')

Specifying index col:

print xs.parse(header = 1, index_col = 0)
     value
idx       
NaN    NaN
a        1

print xs.parse(header = 1, index_col = 0).index
Index([nan, u'a'], dtype='object')

EdChum · Accepted Answer

You can pass skiprows=[1] to skip the blank line, I tested this on a dummy xl sheet, see ExcelFile.parse:

In [44]:

xs = pd.ExcelFile(r'c:\data\book1.xls')
xs.parse(skiprows=[1])

Out[44]:
   idx  value
0   12    NaN
1    2    NaN
2    1    NaN

compare with:

In [45]:

xs = pd.ExcelFile(r'c:\data\book1.xls')
xs.parse()

Out[45]:
   idx  value
0  NaN    NaN
1   12    NaN
2    2    NaN
3    1    NaN

In [47]:

xs = pd.ExcelFile(r'c:\data\book1.xls')
xs.parse(skiprows=[1], header=0)
Out[47]:
   idx  value
0   12    NaN
1    2    NaN
2    1    NaN
In [49]:

xs = pd.ExcelFile(r'c:\data\book1.xls')
xs.parse(skiprows=[1], header=0, index_col=0)
Out[49]:
     value
idx       
12     NaN
2      NaN
1      NaN
In [50]:

xs = pd.ExcelFile(r'c:\data\book1.xls')
xs.parse(header=0, index_col=0)
Out[50]:
     value
idx       
NaN    NaN
 12    NaN
 2     NaN
 1     NaN

Pandas ExcelFile.parse has NaNs in index when index_col is specified

Answers (1)

Related Questions