Reputation: 93
I have an Excel workbook report(abc.xlsx) that I am trying to parse to get data from a couple of columns and I am wondering how this can be done in Python/Pandas. The report is in a report format so it is not nicely formatted like a row-column configuration and it looks like this:
My Report ABC LLC from 06/01/2015 to 06/30/2015 Instrument Identifier Quantity Start End Total North America XYZ 100 0 0 (1,893.52) North America Subtotal (1,893.52) Europe ABC 50 10 20 (4,653.21) Europe Subtotal (4,653.21
The data that I am interested in getting is the identifier (in my Excel it is in columns B and C and they are merged) and Total (column G). The challenge here is this format is somewhat irregular and with a subtotal line. One way I can think of it read this via Pandas using read_excel and then iterate through the rows and just consider rows where column B&C and G are populated, but then not sure how to handle the situation of the headers plus the merged cells. This approach is also pretty ugly as well, and so I am wondering if anyone has experience parsing similar reports in Excel as well.
Upvotes: 1
Views: 798
Reputation: 2559
This is how I might handle this. It will not work in every data munging situation, of course, but it seems like it works on the example you provided.
Going to use pd.read_excel
, then skip the first several rows on import to just get them out of the way.
In [1]: import pandas as pd
In [2]: df = pd.read_excel("abc.xls", skiprows=4)
In [3]: df
Out[3]:
Instrument Identifier Quantity Start End Total
0 NaN NaN NaN NaN NaN NaN
1 North America NaN NaN NaN NaN NaN
2 NaN XYZ 100 0 0 1893.52
3 North America NaN NaN NaN Subtotal 1893.52
4 NaN NaN NaN NaN NaN NaN
5 Europe NaN NaN NaN NaN NaN
6 NaN ABC 50 10 20 4653.21
7 Europe NaN NaN NaN Subtotal 4653.21
Here I'll use the ffill
argument of the fillna
method to forward fill in NaN values in the Instruments column.
In [4]: df.Instrument.fillna(method="ffill", inplace=True)
In [5]: df
Out[5]:
Instrument Identifier Quantity Start End Total
0 NaN NaN NaN NaN NaN NaN
1 North America NaN NaN NaN NaN NaN
2 North America XYZ 100 0 0 1893.52
3 North America NaN NaN NaN Subtotal 1893.52
4 North America NaN NaN NaN NaN NaN
5 Europe NaN NaN NaN NaN NaN
6 Europe ABC 50 10 20 4653.21
7 Europe NaN NaN NaN Subtotal 4653.21
Now just filter out the null values from the Identifier
column.
In [6]: df[df.Identifier.notnull()]
Out[6]:
Instrument Identifier Quantity Start End Total
2 North America XYZ 100 0 0 1893.52
6 Europe ABC 50 10 20 4653.21
Upvotes: 2
Reputation: 2534
There are a wide, wide range of modules available for parsing excel files.
An example of this is (my personal favorite) the openpyxl
module.
Some sample syntax:
wb = openpyxl.load_workbook('example.xlsx') # opens an excel workbook
wb.get_sheet_names() # get sheet names
LOL
'Cell ' + c.coordinate + ' is ' + c.value # getting value of one
cell
openpyxl
has a huge range of functions for parsing data from excel files. You should be able to find an easy way of accomplishing your goal here.
Here are some good resource sites:
https://automatetheboringstuff.com/chapter12/
https://openpyxl.readthedocs.org/en/latest/
Happy coding! and best of luck!
Upvotes: 0