Reputation: 595
I am trying to extract bookmarks from a Docx file, I wrote a piece of code which is extracting bookmarks in some Docx files, but it couldn't find any bookmarks in some docx files. I am using python-docx
I am basically finding w:bookmarkStart
tags and going to its parent tag and retrieving all the runs in the paragraph. But some documents neither have w:bookmarkStart
nor hyperlink tag but the Docx viewers are able to identify the bookmarks.
Here is the XML content of paragraph which is a bookmark in docx viewer but doesn't contain any bookmark or hyperlink tags.
Note: The code I mentioned is working for Docx files created using Google Docs.
from docx.oxml.shared import qn
from docx import Document
def get_toc(self):
doc_element = self.document.part._element
bookmarks_list = doc_element.findall('.//' + qn('w:bookmarkStart'))
for bookmark in bookmarks_list:
par = bookmark.getparent()
runs = par.findall(qn('w:r'))
for run in runs:
try:
print(' ', run.find(qn('w:t')).text, end=' ')
except:
pass
print('\n','-'*50)
Am I missing something or do I need to find some other tags?
If not, how can I identify bookmarks in such scenarios?
Upvotes: 2
Views: 4342
Reputation: 2289
In Open XML documents, a bookmark is defined by a matched pair of one w:bookmarkStart
and one w:bookmarkEnd
element, where each one has a w:id
attribute with the same value.
Here is an example paragraph with a bookmark that just contains the text "second" and not the full text of the paragraph (e.g., "First, second, and third").
<w:p>
<w:r>
<w:t xml:space="preserve">First, </w:t>
</w:r>
<w:bookmarkStart w:id="1" w:name="MyBookmarkName" />
<w:r>
<w:t>second</w:t>
</w:r>
<w:bookmarkEnd w:id="1" />
<w:r>
<w:t>, and third.</w:t>
</w:r>
</w:p>
This means that:
w:bookmarkStart
and w:bookmarkEnd
elements (so the paragraph you linked does not contain a bookmark) andw:p
just because you found a w:bookmarkStart
element is not correct.And there are more things to note:
w:p
containing the w:bookmarkStart
and one or more trailing runs of the w:p
containing the w:bookmarkEnd
.w:bookmarkStart
and w:bookmarkEnd
can even appear outside of w:p
elements, e.g., as child elements of w:body
.Upvotes: 2