Pandas advanced read_excel or ExcelFile.parse

Question

I'm trying to do some conditional parsing of excel files into Pandas dataframes. I have a group of excel files and each has some number of lines at the top of the file that are not part of the data -- some identification data based on what report parameters were used to create the report.

I want to use the ExcelFile.parse() method with skiprows=some_number but I don't know what some_number will be for each file.

I do know that the HeaderRow will start with one member of a list of possibilities. How can I tell Pandas to create the dataframe starting on the row that includes any some_string in my list of possibilities?

Or, is there a way to import the entire sheet and then remove the rows preceding the row that includes any some_string in my list of possibilities?

Andy Hayden · Accepted Answer

Most of the time I would just post-process this in pandas, i.e. diagnose, remove the rows, and correct the dtypes, in pandas. This has the benefit of being easier but is arguably less elegant (I suspect it'll also be faster doing it this way!):

In [11]: df = pd.DataFrame([['blah', 1, 2], ['some_string', 3, 4], ['foo', 5, 6]])

In [12]: df
Out[12]:
             0  1  2
0         blah  1  2
1  some_string  3  4
2          foo  5  6

In [13]: df[0].isin(['some_string']).argmax()  # assuming it's found
Out[13]: 1

I may actually write this in python, as it's probably little/no benefit in vectorizing (and I find this more readable):

def to_skip(df, preceding):
    for s in enumerate(df[0]):
        if s in preceding:
            return i
    raise ValueError("No preceding string found in first column")

In [21]: preceding = ['some_string']

In [22]: to_skip(df, preceding)
Out[22]: 1

In [23]: df.iloc[1:]  # or whatever you need to do
Out[23]:
             0  1  2
1  some_string  3  4
2          foo  5  6

The other possibility, messing about with ExcelFile and finding the row number could be doing (again with a for-loop as above but in openpyxl or similar). However, I don't think there would be a way to read the excel file (xml) just once if you do this.

This is somewhat unfortunate when compared to how you could do this on a csv, where you can read the first few lines (until you see the row/entry you want), and then pass this opened file to read_csv. (If you can export your Excel spreadsheet to csv then parse in pandas, that would be faster/cleaner...)

Note: read_excel isn't really that fast anyways (esp. compared to read_csv)... so IMO you want to get to pandas asap.

Pandas advanced read_excel or ExcelFile.parse

Answers (1)

Related Questions