Python Excel Pandas DataFrame Import - Handling nested (and merged) headers

Question

I've spent a few days now working on an issue to import a series of Excel spreadsheets into Pandas DataFrames. I'm relatively experienced in doing this, and can handle a variety of different scenarios, but I am still quite new to Python and am really struggling to overcome this issue.

Essentially, I have a couple of Excel files, that I want to import individually, that have nested (2 row, multi-column) headers. I want to merge the two rows into a single header row, concatenating the values contained within row 1 with the values found in row 2 for a given column.

If each row's columns perfectly aligned, I feel confident I'd know how to merge the data into a single row, concatenating where appropriate. However, in the instances of these particular spreadsheets there are columns in row 1 that are blank, as well as merged columns where 1 merged column in row 1 might be aligned with multiple columns in row 2.

I hope that the following better demonstrates the situation (the top 'column' output is indicative to show cell positioning):

Current header configuration

A desirable header row needs to look as follows:

Desired header configuration

There are many more columns than this, each with differing configurations of (blank) and merged, or perfectly aligned, columns in row 1 vs. row 2.

If it helps to know, row 2 is predictable - it's only row 1 that contains the (blank) or merged rows etc. Also, in row 1, the first n values are always (blank) - NaN. However, one spreadsheet has a different number of (blank) values at the start vs. the other spreadsheet.

I was hoping to be able to dynamically handle the two rows, for whatever the value population in the rows might look like.

I've tried many different approaches (too many to remember / mention - including coming up with a weird YAML solution to manually map the column definitions which became very unwieldy) and it's likely that I need to combine some of what I've already tried in order to formulate a solution. I just can't seem to nail down the actual solution.

Presently, my approach is to try and fill the first n (blank) columns in row 1 with the values in the associated columns from row 2. Then, Ffill the merged headers across the DataFrame. Once complete, I can then merge the values from row 2 into row 1. However, I cannot seem to get this to work consistently. Ffill seems to be quite temperamental with these DataFrames and it's all a little painful.

There is also the (very real) possibility that I'm overcomplicating the whole issue and that there's a much easier solution.

As previously stated, I'm still quite new to Python. I'm also very new to Stackoverflow - this is my first post. I hope that I've outlined my question correctly and provided enough information. I have searched around for answers to this, but I've not been able to locate anything that fits the scenario I'm facing. If anyone could help, or provide some ideas as to how I could approach this better, I would be very grateful.

Python Excel Pandas DataFrame Import - Handling nested (and merged) headers

Answers (0)

Related Questions