Phil M
Phil M

Reputation: 93

Parse Excel Report

I have an Excel workbook report(abc.xlsx) that I am trying to parse to get data from a couple of columns and I am wondering how this can be done in Python/Pandas. The report is in a report format so it is not nicely formatted like a row-column configuration and it looks like this:

                                 My Report
                                 ABC LLC
                        from 06/01/2015 to 06/30/2015 

Instrument                Identifier    Quantity    Start    End     Total 

North America
                            XYZ           100         0      0        (1,893.52)
North America Subtotal                                                (1,893.52)

Europe
                            ABC            50         10     20        (4,653.21)
Europe Subtotal                                                       (4,653.21     

The data that I am interested in getting is the identifier (in my Excel it is in columns B and C and they are merged) and Total (column G). The challenge here is this format is somewhat irregular and with a subtotal line. One way I can think of it read this via Pandas using read_excel and then iterate through the rows and just consider rows where column B&C and G are populated, but then not sure how to handle the situation of the headers plus the merged cells. This approach is also pretty ugly as well, and so I am wondering if anyone has experience parsing similar reports in Excel as well.

Upvotes: 1

Views: 798

Answers (2)

dagrha
dagrha

Reputation: 2559

This is how I might handle this. It will not work in every data munging situation, of course, but it seems like it works on the example you provided.

Going to use pd.read_excel, then skip the first several rows on import to just get them out of the way.

In [1]: import pandas as pd

In [2]: df = pd.read_excel("abc.xls", skiprows=4)

In [3]: df
Out[3]:
      Instrument Identifier  Quantity  Start       End    Total
0            NaN        NaN       NaN    NaN       NaN      NaN
1  North America        NaN       NaN    NaN       NaN      NaN
2            NaN        XYZ       100      0         0  1893.52
3  North America        NaN       NaN    NaN  Subtotal  1893.52
4            NaN        NaN       NaN    NaN       NaN      NaN
5         Europe        NaN       NaN    NaN       NaN      NaN
6            NaN        ABC        50     10        20  4653.21
7         Europe        NaN       NaN    NaN  Subtotal  4653.21

Here I'll use the ffill argument of the fillna method to forward fill in NaN values in the Instruments column.

In [4]: df.Instrument.fillna(method="ffill", inplace=True)

In [5]: df
Out[5]:
      Instrument Identifier  Quantity  Start       End    Total
0            NaN        NaN       NaN    NaN       NaN      NaN
1  North America        NaN       NaN    NaN       NaN      NaN
2  North America        XYZ       100      0         0  1893.52
3  North America        NaN       NaN    NaN  Subtotal  1893.52
4  North America        NaN       NaN    NaN       NaN      NaN
5         Europe        NaN       NaN    NaN       NaN      NaN
6         Europe        ABC        50     10        20  4653.21
7         Europe        NaN       NaN    NaN  Subtotal  4653.21

Now just filter out the null values from the Identifier column.

In [6]: df[df.Identifier.notnull()]
Out[6]:
      Instrument Identifier  Quantity  Start End    Total
2  North America        XYZ       100      0   0  1893.52
6         Europe        ABC        50     10  20  4653.21

Upvotes: 2

Joseph Farah
Joseph Farah

Reputation: 2534

There are a wide, wide range of modules available for parsing excel files. An example of this is (my personal favorite) the openpyxl module. Some sample syntax:

wb = openpyxl.load_workbook('example.xlsx') # opens an excel workbook

wb.get_sheet_names() # get sheet namesLOL

'Cell ' + c.coordinate + ' is ' + c.value # getting value of one cell

openpyxl has a huge range of functions for parsing data from excel files. You should be able to find an easy way of accomplishing your goal here.

Here are some good resource sites:

https://automatetheboringstuff.com/chapter12/

https://openpyxl.readthedocs.org/en/latest/

Happy coding! and best of luck!

Upvotes: 0

Related Questions