Reputation: 388
I'm wondering is there a way to make openpyxl to ignore cells without a values? Cells are in xml tree, but have no values, only styles defined
The problem is, I have a sheet, which xml structure is like this:
<row r="18" spans="3:9 16384:16384" ht="15" customHeight="1" thickBot="1">
<c r="C18" s="14" t="s">
<v>24</v>
</c>
<c r="D18" s="14" t="s">
<v>25</v>
</c>
<c r="E18" s="14" t="s">
<v>27</v>
</c>
...
<c r="I18" s="12"/>
<c r="XFD18" s="13"/>
</row>
<row r="1048576" spans="4:4" ht="15" customHeight="1">
<c r="D1048576" s="13"/>
</row>
When openpyxl
gets to the row 18 and tries to get cells, the way it generates cells make it to fill the gap between cells I18 and XFD18 with 16336 Empty cells. And same goes for rows 18 - 1048576, I'm getting a million of empty rows because of the way rows in XML tree are being parsed in ws.iter_rows
method. I see logic beihind it and I understand that it's a problem of xlsx file itself, but I want to know whether it's something I can deal with by using openpyxl API?
I know that xlrd
ignores empty cells formatting (it's just not implemented) and thus xlrd
won't add such cells to the row. Is there a way to make openpyxl act the same? Looking at source code, I see no way of doing that using existing API, except from defining max_column
and max_row
for ws.iter_rows
method (which is not optimal, since actual number of columns/rows may vary)
I managed to accomplish the task modifying the xml_source
like that (wb is opened as read_only):
import io
import xml.etree.ElementTree as ET
from openpyxl.xml.functions import iterparse, safe_iterator
from openpyxl.xml.constants import SHEET_MAIN_NS
ROW_TAG = '{%s}row' % SHEET_MAIN_NS
CELL_TAG = '{%s}c' % SHEET_MAIN_NS
VALUE_TAG = '{%s}v' % SHEET_MAIN_NS
SHEET_DATA = '{%s}sheetData' % SHEET_MAIN_NS
def clean_up_xml_from_empty_rows_and_cells(sheet):
sheet_data = None
rows_to_delete = set()
p = iterparse(sheet.xml_source, tag=[ROW_TAG], remove_blank_text=True)
for _event, element in p:
if element.tag == SHEET_DATA:
sheet_data = element
continue
if element.tag == ROW_TAG:
cells_to_delete = set()
for cell in safe_iterator(element, CELL_TAG):
if cell.findtext(VALUE_TAG) is None:
cells_to_delete.add(cell)
for cell in cells_to_delete:
element.remove(cell)
if not len(element):
rows_to_delete.add(element)
for row in rows_to_delete:
sheet_data.remove(row)
out_file = io.BytesIO()
element_tree = ET.ElementTree(p.root)
element_tree.write(out_file, encoding='utf8')
out_file.seek(0)
sheet.xml_source = out_file
But it feels so wrong! And so dangerous, I'm absolutely unsure the function like that won't break the parsing of other xlsx files.
Is there really no way to ignore cells without a values? Browsing SO and google - it feels like noone ever encountered such problems :(
Upvotes: 1
Views: 466
Reputation: 19537
You're probably best off patching parse_row()
which is where this stuff is controlled. This is easier to work with and thest than messing with the parser.
Upvotes: 1