Reputation: 1189
I have an excel file that I am reading into a pandas DataFrame that has the header on row 1 (python index) and a blank row between the header and the data. When I specify index_col it treats the blank row as part of the index as a NaN. What is the best way to avoid this behavior?
Test file:
idx value
a 1
Without specifying index_col:
print xs.parse(header = 1)
idx value
0 NaN NaN
1 a 1
print xs.parse(header = 1).index
Int64Index([0, 1], dtype='int64')
Specifying index col:
print xs.parse(header = 1, index_col = 0)
value
idx
NaN NaN
a 1
print xs.parse(header = 1, index_col = 0).index
Index([nan, u'a'], dtype='object')
Upvotes: 0
Views: 1068
Reputation: 394209
You can pass skiprows=[1]
to skip the blank line, I tested this on a dummy xl sheet, see ExcelFile.parse
:
In [44]:
xs = pd.ExcelFile(r'c:\data\book1.xls')
xs.parse(skiprows=[1])
Out[44]:
idx value
0 12 NaN
1 2 NaN
2 1 NaN
compare with:
In [45]:
xs = pd.ExcelFile(r'c:\data\book1.xls')
xs.parse()
Out[45]:
idx value
0 NaN NaN
1 12 NaN
2 2 NaN
3 1 NaN
In [47]:
xs = pd.ExcelFile(r'c:\data\book1.xls')
xs.parse(skiprows=[1], header=0)
Out[47]:
idx value
0 12 NaN
1 2 NaN
2 1 NaN
In [49]:
xs = pd.ExcelFile(r'c:\data\book1.xls')
xs.parse(skiprows=[1], header=0, index_col=0)
Out[49]:
value
idx
12 NaN
2 NaN
1 NaN
In [50]:
xs = pd.ExcelFile(r'c:\data\book1.xls')
xs.parse(header=0, index_col=0)
Out[50]:
value
idx
NaN NaN
12 NaN
2 NaN
1 NaN
Upvotes: 1