Reputation: 344
A little background
I have a software specification which I need to parse requirements from in the form of tables. They are not always in the same format either. I inherited a python script that uses win32com to parse the word document, and then openpyxl to export the requirements to an excel file, which is then uploaded to HP ALM.
Question
Using python (or some other language that can communicate with python), I am looking for a relatively simple and easy way to differentiate between merged cells, and empty cells (both of which occur in the microsoft word documents)(2010 .docx).
Explanation
So far, I have been searching for a solution to this for a couple weeks now, but I haven't found a satisfactory answer to the problem yet.
There are questions here and here that I've looked at on stackoverflow. The second question says there is a field which will tell you whether there are merged cells in a table, which is a starting point, but not sufficient since it's possible the table will be one super long table spanning many pages.
Attempts at a solution
Attempt 1.) My first thought was that surely win32com supports detecting merged cells in a table. So I searched and searched for methods that would do this for me. The only thing I found that would work is checking whether a merged cell is blank, while the previous one isn't. But, then I can't tell if the cell is truly blank or merged.
Attempt 2.) My next thought was to add the feature to win32com using COM and the win32 API. But, I found COM is quite unwieldy, out-of-date now, and super undocumented and difficult to use. The same goes for the win32 API. Basically, I found this is more effort than it's worth to do.
Attempt 3.) Then I began looking for alternative libraries to win32com, such as docx for python. The issue here, is that I work on a non-administrator computer, which severely restricts my ability to download third party libraries. Thus, I have yet to try this option, because I went down this road when getting win32com and openpyxl.
Attempt 4.) My latest and possibly final attempt at figuring this out was to turn the word docx document into an XML file that I can parse easily. However, I don't know XML, nor do I know the standard format word uses for XML.
And here I am now looking for the quickest, cleanest, way to do this without rewriting libraries, or starting my 1000 line script over from scratch. (which by the way has a display GUI layed over the top of it, that's why it's so long)
Upvotes: 1
Views: 3011
Reputation: 1279
If you decide to use the docx
module from python-docx
(which would be my recommendation), merged cells are the same object in memory - so if you have a row of 3 cells and the first 2 are merged, row.cells[0] == row.cells[1]
is True
. Given that, I made two simple functions to return the indicies of merged cells.
import docx
def get_indicies_of_uniques(items):
unique_indicies = {}
for index, item in enumerate(items):
if item not in unique_indicies:
unique_indicies[item] = []
unique_indicies[item].append(index)
return unique_indicies
def get_merged_indicies(row):
unique_indicies = get_indicies_of_uniques(row.cells)
return [indicies for indicies in list(unique_indicies.values())
if len(indicies) > 1]
For the case where the row contains 3 cells, the first 2 being merged, the following is the result:
get_merged_indicies(row)
# returns [[0, 1]]
If you have 2 merged cells, 1 unmerged cell, then 2 more merged cells (total 5 cells in the row):
get_merged_indicies(row_with_5_cells):
# returns [[0, 1], [3, 4]]
I'm not sure what format you need to results of such a function, but this may get you started in the right direction.
Upvotes: 0
Reputation: 2851
According to the doc merged cells in Word become one cell after being merged (unlike excel). So the concept of merged cells do not really exists in Word. The only way to detect them will be to analyse all the tables with the algorithm you found in the posts linked in your question. Which consist on finding missing cells that do not exists because another cell is taking their place (which is the result of a merge).
Upvotes: 1