Reputation: 413
I have an excel sheet like this:
I want to read it with pandas read_excel and I tried this:
df = pd.read_excel("test.xlsx", header=[0,1])
but it throws me this error:
ParserError: Passed header=[0,1] are too many rows for this multi_index of columns
Any suggestions?
Upvotes: 5
Views: 17718
Reputation: 1873
If you don't mind massaging the DataFrame after reading the Excel you can try the below two ways:
>>> pd.read_excel("/tmp/sample.xlsx", usecols = "B:F", skiprows=[0])
header1 Unnamed: 1 Unnamed: 2 header2 Unnamed: 4
0 col1 col2 col3 col4 col5
1 a 0 x 3 d
2 b 1 y 4 e
3 c 2 z 5 f
In above, you'd have to fix the first level of the MultiIndex since header1 and header2 are merged cells
>>> pd.read_excel("/tmp/sample.xlsx", header=[0,1], usecols = "B:F",
skiprows=[0])
header1 header2
header1 col1 col2 col3 col4
a 0 x 3 d
b 1 y 4 e
c 2 z 5 f
In above, it got pretty close by skipping the empty row and parsing only columns (B:F) with data. If you notice, the columns got shifted though...
Note Not a clean solution but just wanted to share samples with you in a post rather than a comment
-- Edit based on discussion with OP --
Based on documentation for pandas read_excel, header[1,2]
is creating a MultiIndex for your columns. Looks like it determines the labels for the DataFrame
depending on what is populated in Column A. Since there's nothing there... the index has a bunch of Nan
like so
>>> pd.read_excel("/tmp/sample.xlsx", header=[1,2])
header1 header2
col1 col2 col3 col4 col5
NaN a 0 x 3 d
NaN b 1 y 4 e
NaN c 2 z 5 f
Again if you're okay with cleaning up columns and if the first column of the xlsx is always blank... you can drop it like below. Hopefully this is what you're looking for.
>>> pd.read_excel("/tmp/sample.xlsx", header[1,2]).reset_index().drop(['index'], level=0, axis=1)
header1 header2
col1 col2 col3 col4 col5
0 a 0 x 3 d
1 b 1 y 4 e
2 c 2 z 5 f
Upvotes: 6
Reputation: 13750
Here is the documentation on the header
parameter:
Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a MultiIndex. Use None if there is no header.
I think the following should work:
df = pd.read_excel("test.xlsx", skiprows=2, usecols='B:F', header=0)
Upvotes: 1