Satyaaditya
Satyaaditya

Reputation: 595

Understanding bookmarks in docx file

I am trying to extract bookmarks from a Docx file, I wrote a piece of code which is extracting bookmarks in some Docx files, but it couldn't find any bookmarks in some docx files. I am using python-docx

I am basically finding w:bookmarkStart tags and going to its parent tag and retrieving all the runs in the paragraph. But some documents neither have w:bookmarkStart nor hyperlink tag but the Docx viewers are able to identify the bookmarks.

Here is the XML content of paragraph which is a bookmark in docx viewer but doesn't contain any bookmark or hyperlink tags.

Note: The code I mentioned is working for Docx files created using Google Docs.

    from docx.oxml.shared import qn
    from docx import Document

    def get_toc(self):
        doc_element = self.document.part._element
        bookmarks_list = doc_element.findall('.//' + qn('w:bookmarkStart'))
        for bookmark in bookmarks_list:
            par = bookmark.getparent()
            runs = par.findall(qn('w:r'))
            for run in runs:
                try:
                    print(' ', run.find(qn('w:t')).text, end=' ')
                except:
                    pass
            print('\n','-'*50)

Am I missing something or do I need to find some other tags?

If not, how can I identify bookmarks in such scenarios?

Upvotes: 2

Views: 4342

Answers (1)

Thomas Barnekow
Thomas Barnekow

Reputation: 2289

In Open XML documents, a bookmark is defined by a matched pair of one w:bookmarkStart and one w:bookmarkEnd element, where each one has a w:id attribute with the same value.

Here is an example paragraph with a bookmark that just contains the text "second" and not the full text of the paragraph (e.g., "First, second, and third").

<w:p>
  <w:r>
    <w:t xml:space="preserve">First, </w:t>
  </w:r>
  <w:bookmarkStart w:id="1" w:name="MyBookmarkName" />
  <w:r>
    <w:t>second</w:t>
  </w:r>
  <w:bookmarkEnd w:id="1" />
  <w:r>
    <w:t>, and third.</w:t>
  </w:r>
</w:p>

This means that:

  • there is no bookmark without those w:bookmarkStart and w:bookmarkEnd elements (so the paragraph you linked does not contain a bookmark) and
  • retrieving the full text of the w:p just because you found a w:bookmarkStart element is not correct.

And there are more things to note:

  • A bookmark can span multiple paragraphs, leaving out one or more leading runs of the w:p containing the w:bookmarkStart and one or more trailing runs of the w:p containing the w:bookmarkEnd.
  • Both w:bookmarkStart and w:bookmarkEnd can even appear outside of w:p elements, e.g., as child elements of w:body.

Upvotes: 2

Related Questions