why is python-docx returning cells with text when should be empty?

Question

I have a docx document converted from pdf with pdf2docx library. The result seems good but if I load docx document with python-docx it creates a table with cells that contain texts instead of empty cells. The cells are filled with text from cells that is one row above the particular cells.

Table is look like this:

The table contains three rows. First row should contain cells with values [Barriere, Bonuslevel, Cap, Beobachtungszeitraum, Anfangl] and second and third rows should be empty except for last one column. But if can see in debug that empty cells contain text values like this:

Text Basiswert is in the first cell and in the sixth cell. The sixth cell should be empty. I opened an XML file of Docx document and there is everything ok so I think the problem is in python-docx library. Have anyone ever had the same problem?

Edit: This article comes very valuable:

https://python-docx.readthedocs.io/en/latest/dev/analysis/features/table/cell-merge.html

Basically the copied cells are continuation cells which indicates that cells are merged into horizontal or vertical spans but still I dont know how to read this information from python-docx API?

scanny · Accepted Answer

The addressing of table cells in python-docx is based on the grid layout. Basically the grid is all the cells before any cell merging is done. In the grid layout there are n rows and m columns and m * n cells; each row-column combination/intersection has a cell.

When you address a grid cell that is "merged" into some other cell, then the top-left member of the merged (rectangular) region is returned.

This means that some content is returned more than once if the table includes merged cells.

why is python-docx returning cells with text when should be empty?

Answers (1)

Related Questions