Is there a way of differentiating collapsed column in excel sheet via python

Question

I am reading excel sheet via python and trying to read only visible rows in python (not hidden or collapsed). I went through documentation of OPENPYXL and found that it has "hidden" and "collapsed" property. But once I read the excel sheet not always "hidden" or "collapsed" is true when column is hidden. My code is following

def read_visible_data_from_sheet(sheet):
    data = []
    # Iterate through rows
    for row in sheet.iter_rows():
        row_num = row[0].row
        # Check if the row is hidden or has height set to 0
        row_hidden = sheet.row_dimensions[row_num].hidden
        row_height = sheet.row_dimensions[row_num].height
        row_level = sheet.row_dimensions[row_num].outlineLevel

        if row_hidden or (row_height is not None and row_height == 0):
            continue  # Skip hidden rows

        # Check if any parent row is collapsed
        is_collapsed = False
        for parent_row_num in range(1, row_num):
            if sheet.row_dimensions[parent_row_num].outlineLevel < row_level and sheet.row_dimensions[parent_row_num].hidden == True:
                is_collapsed = True
                break
        if is_collapsed:
            continue # Skip collapsed rows

        visible_row = []
        # Iterate through columns in the row
        for cell in row:
            col_letter = cell.column_letter
            col_dim = sheet.column_dimensions.get(col_letter)
            if col_dim:
                col_hidden = col_dim.hidden
                col_width = col_dim.width
            else:
                continue
            # Check if the column is hidden or has width set to 0
            if col_hidden or (col_width is not None and col_width == 0):
                continue  # Skip hidden columns

            visible_row.append(cell.value)
        # Append the visible row to the data list
        if visible_row:  # Avoid adding empty rows
            data.append(visible_row)

    # Convert to a DataFrame
    df = pd.DataFrame(data)
    return df, sheet

In some cases the sheet.column_dimensions contains columns which are visible while in other cases it doesn't contain columns in sheet.column_dimensions even if it is visible.

Is there a better way to deal with such cases? I am open on exploring any other library if necessary.

moken · Accepted Answer

There may be some inconsistencies in how Openpyxl handles the hidden row and column information however there is a couple of things to keep in mind when searching for this detail.

Row height or Column width may be (Python) None unless you have actually set the value yourself, even though the row or column does have a height or width. So these may not be good for determining if it's hidden.
If a row or column has its height or width set to zero, that object will become hidden and the height or width value will be the previous or default value, i.e. it will not be 0.

As mentioned in the example since the Columns status is not going to change it seems wasteful to keep checking the status for every row your are copying.
Check the columns once make a list and then use that when cycling through the rows.

The following Code example creates a Visible Columns list and then uses it to copy all rows for the dataframe.

The code loops the Rows and on the first iteration creates the Visible Columns List visible_columns from the first row's column status. Once the List is created it is used for the first and all subsequent rows.
The criteria for a visible columns is based on;

Looks for columns that are in the sheet column_group list and whether they are hidden or not since only the first column in a collapsed group is marked as hidden.
Looks for other columns that are otherwise marked as 'hidden' Columns that are in a hidden group that are not the first in the range are not marked as hidden.

The code then processes each row starting from the first row (used to create the list). After checking if the row is visible, copies the values from the row per the visible_columns List.

A row's visibility status is slightly different to columns. The row is determined to be visible purely by that row's hidden status in row_dimensions

Example Sheet
The following example Sheet was used;

The Sheet has the following Grouped and Hidden Rows and Columns;

"""
Hidden Rows & Columns

Columns;
C, width was set to 0
E & G, Hidden
J - L, Grouped, Collapsed
N & O, Grouped, Not collapsed


Rows;
7, Height was set to 0
15 - 19, Grouped, Collapsed
27, Empty Row 
31 & 32, Hidden
35 - 37, Grouped, Not collapsed 
"""

Code Sample

import pandas as pd
import openpyxl
from openpyxl.utils.cell import range_boundaries as rb



def read_visible_data_from_sheet(sheet):
    data = []
    visible_columns = []

    # Get the number of hidden (collapsed) columns
    hidden_grouped_cols = []  # The List of Grouped Columns that are hidden
    for range_string in sheet.column_groups:
        group_collapsed = sheet.column_dimensions[range_string[:1]].hidden
        rng_boundaries = rb(range_string)
        hidden_grouped_cols += [x for x in range(rng_boundaries[0], rng_boundaries[2]+1) if group_collapsed]

    get_visible_columns = True  # First Row iteration build a List of visible Columns
    # Iterate rows (all rows or set min and max row to iterate over)
    for row in sheet.iter_rows():
        visible_row = []
        # Build list of visible columns from first row checked
        if get_visible_columns:  # get_visible_columns is True for first row
            for cell in row:  # Looping the columns
                col_info = sheet.column_dimensions[cell.column_letter]
                # Create List from non hidden Columns including those in uncollapsed groups
                if not col_info.hidden and cell.col_idx not in hidden_grouped_cols:
                    visible_columns.append(cell.column_letter)

            get_visible_columns = False  # First row builds the Visible Columns List. Subsequent rows can skip this step

        # Add values from Row if it's not hidden
        if not sheet.row_dimensions[row[0].row].hidden:
            # Get List of values from all visible columns
            visible_row = [cell.value for cell in row if cell.column_letter in visible_columns]
            # Empty Row check
            if all(e is None for e in visible_row):  # Drop row if all values obtained are None
                visible_row = None

        # If visible_row contains data add to the data List
        if visible_row:  # Avoid adding no value row
            data.append(visible_row)

    return data


# Open Excel file and get Sheet
excelfile = 'foo.xlsx'
wb = openpyxl.load_workbook(excelfile)
ws = wb['Sheet1']

# Create dataframe from visible rows and columns
df = pd.DataFrame(read_visible_data_from_sheet(ws))

print(df)

Output

Visible Columns List;  ['A', 'B', 'D', 'F', 'H', 'I', 'M', 'N', 'O', 'P']

Dataframe;
       0                    1         2               3         4        5     6     7     8     9
0   ROW#                    B         D               F         H        I    M      N     O     P
1      2                 Date  DMA Code       Car Model  Visitors  Revenue  col4  col5  col6  col7
2      3  2017-02-22 00:00:00       500           Focus         3       47     3     4     5     6
3      4  2017-02-11 00:00:00       500         Avenger         3       27     3     4     5     6
4      5  2017-03-22 00:00:00       500        Explorer         1       16     3     4     5     6
5      6  2017-07-03 00:00:00       500        3 Series         1       12     3     4     5     6
6      8  2017-07-15 00:00:00       500           Civic        16      100     3     4     5     6
7      9  2017-07-22 00:00:00       500          Accent         1       16     3     4     5     6
8     10  2017-07-24 00:00:00       500              ML       214       39     3     4     5     6
9     11  2017-08-28 00:00:00       500              RX        11       30     3     4     5     6
10    12  2017-06-03 00:00:00       500  Silverado 1500         3       30     3     4     5     6
11    13  2017-08-18 00:00:00       500            Neon         1       10     3     4     5     6
12    14  2017-03-04 00:00:00       501          Sonata         1        9     3     4     5     6
13    20  2017-02-02 00:00:00       501         Integra        16      112     3     4     5     6
14    21  2017-02-27 00:00:00       501              S4         3       31     3     4     5     6
15    22  2017-02-12 00:00:00       501        3 Series        10      107     3     4     5     6
16    23  2017-02-16 00:00:00       501              X3         8       72     3     4     5     6
17    24  2017-02-06 00:00:00       501          Camaro        12      246     3     4     5     6
18    25  2017-02-04 00:00:00       501          Impala         5       47     3     4     5     6
19    26  2017-02-07 00:00:00       501          Accord        10      909     3     4     5     6
20    28  2017-02-26 00:00:00       501           Aztek        10        7     3     4     5     6
21    29  2017-02-22 00:00:00       501             Ion         3       47     3     4     5     6
22    30  2017-02-22 00:00:00       501           Aerio         3       47     3     4     5     6
23    33  2017-03-06 00:00:00       501          Altima         2       16     3     4     5     6
24    34  2017-06-18 00:00:00       501             RDX         5       47     3     4     5     6
25    35  2017-06-19 00:00:00       501              M3         3       93     3     4     5     6
26    36  2017-06-17 00:00:00       501         Mustang         6      124     3     4     5     6
27    37  2017-06-09 00:00:00       501           Pilot         4       47     3     4     5     6
28    38  2017-06-17 00:00:00       501         Mark LT         3       47     3     4     5     6

Is there a way of differentiating collapsed column in excel sheet via python

Answers (1)

Related Questions