Pandas read excel sheet with multiple header when first column is empty

Question

I have an excel sheet like this:

I want to read it with pandas read_excel and I tried this:

df = pd.read_excel("test.xlsx", header=[0,1])

but it throws me this error:

ParserError: Passed header=[0,1] are too many rows for this multi_index of columns

Any suggestions?

Orenshi · Accepted Answer

If you don't mind massaging the DataFrame after reading the Excel you can try the below two ways:

>>> pd.read_excel("/tmp/sample.xlsx", usecols = "B:F", skiprows=[0])
  header1 Unnamed: 1 Unnamed: 2 header2 Unnamed: 4
0    col1       col2       col3    col4       col5
1       a          0          x       3          d
2       b          1          y       4          e
3       c          2          z       5          f

In above, you'd have to fix the first level of the MultiIndex since header1 and header2 are merged cells

>>> pd.read_excel("/tmp/sample.xlsx", header=[0,1], usecols = "B:F", 
skiprows=[0])
        header1      header2
header1    col1 col2    col3 col4
a             0    x       3    d
b             1    y       4    e
c             2    z       5    f

In above, it got pretty close by skipping the empty row and parsing only columns (B:F) with data. If you notice, the columns got shifted though...

Note Not a clean solution but just wanted to share samples with you in a post rather than a comment

-- Edit based on discussion with OP --

Based on documentation for pandas read_excel, header[1,2] is creating a MultiIndex for your columns. Looks like it determines the labels for the DataFrame depending on what is populated in Column A. Since there's nothing there... the index has a bunch of Nan like so

>>> pd.read_excel("/tmp/sample.xlsx", header=[1,2])
    header1           header2
       col1 col2 col3    col4 col5
NaN       a    0    x       3    d
NaN       b    1    y       4    e
NaN       c    2    z       5    f

Again if you're okay with cleaning up columns and if the first column of the xlsx is always blank... you can drop it like below. Hopefully this is what you're looking for.

>>> pd.read_excel("/tmp/sample.xlsx", header[1,2]).reset_index().drop(['index'], level=0, axis=1)
  header1           header2
     col1 col2 col3    col4 col5
0       a    0    x       3    d
1       b    1    y       4    e
2       c    2    z       5    f

Pandas read excel sheet with multiple header when first column is empty

Answers (2)

Related Questions