Reputation: 3010
I have a problem with detecting which rows are hidden when I open workbook in read-only mode.
It works flawlessly when I set read_only
parameter to False
while loading workbook, because then I can iterate over row_dimensions
to check which rows are hidden - but opening workbook in read-write mode takes much longer (~2 mins vs ~20 secs in read-only mode) and consumes over 1GB of RAM.
Unfortunately read-only worksheets don't have row_dimensions
attribute.
Any help is welcome.
Upvotes: 2
Views: 1455
Reputation: 3382
The underlying issue is that the parser is used once and discarded after iterating over all the rows. This is how read_only mode can optimize memory allocation and generate rows upon request. Interestingly enough, the parser itself is still creating the row_dimensions
with the row attributes in it!
There are a couple of work arounds you could attempt. In lieu of forking and creating an official fix that exposes the ReadOnlyWorksheet parser, I went with monkey patching:
from openpyxl.worksheet._read_only import ReadOnlyWorksheet, WorkSheetParser, EMPTY_CELL
# The override:
class MyReadOnlyWorksheet(ReadOnlyWorksheet):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.parser = None
def row_is_hidden(self, row_index):
str_row_index = str(row_index)
if self.parser and str_row_index in self.parser.row_dimensions:
return self.parser.row_dimensions[str_row_index].get('hidden') == '1'
if self.parser is None or row_index > self.parser.row_counter:
raise RuntimeError('Must generate the row before calling')
return False
def _cells_by_row(self, min_col, min_row, max_col, max_row, values_only=False):
"""
The source worksheet file may have columns or rows missing.
Missing cells will be created.
Logically the same but saves the parser to "self" during row iteration
"""
filler = EMPTY_CELL
if values_only:
filler = None
max_col = max_col or self.max_column
max_row = max_row or self.max_row
empty_row = []
if max_col is not None:
empty_row = (filler,) * (max_col + 1 - min_col)
counter = min_row
idx = 1
src = self._get_source()
parser = WorkSheetParser(src, self._shared_strings,
data_only=self.parent.data_only, epoch=self.parent.epoch,
date_formats=self.parent._date_formats)
### Cache parser in order to check generated row attrs ###
self.parser = parser
for idx, row in parser.parse():
if max_row is not None and idx > max_row:
break
# some rows are missing
for _ in range(counter, idx):
counter += 1
yield empty_row
# return cells from a row
if counter <= idx:
row = self._get_row(row, min_col, max_col, values_only)
counter += 1
yield row
if max_row is not None and max_row < idx:
for _ in range(counter, max_row+1):
yield empty_row
src.close()
# the monkey patch:
import openpyxl.reader.excel
openpyxl.reader.excel.ReadOnlyWorksheet = MyReadOnlyWorksheet
# the test drive:
from openpyxl import load_workbook
file_location = '' # load your file
workbook = load_workbook(file_location, data_only=True, keep_links=False, read_only=True)
for worksheet in workbook.worksheets:
row_gen = worksheet.rows
for i, row in enumerate(row_gen, start=1):
if worksheet.row_is_hidden(i):
continue # do not process hidden rows.
This does what you need, but beware! I would add sufficient test coverage before using in production (think things like future version re-keying row_dimension dict, removing row_dimensions from read_only parsing, etc). You can similarly add your own accessors to the worksheet that exposes other row attrs (or return the entire dict).
Happy coding!
Upvotes: 2