Understanding bookmarks in docx file

Question

I am trying to extract bookmarks from a Docx file, I wrote a piece of code which is extracting bookmarks in some Docx files, but it couldn't find any bookmarks in some docx files. I am using python-docx

I am basically finding w:bookmarkStart tags and going to its parent tag and retrieving all the runs in the paragraph. But some documents neither have w:bookmarkStart nor hyperlink tag but the Docx viewers are able to identify the bookmarks.

Here is the XML content of paragraph which is a bookmark in docx viewer but doesn't contain any bookmark or hyperlink tags.

Note: The code I mentioned is working for Docx files created using Google Docs.

    from docx.oxml.shared import qn
    from docx import Document

    def get_toc(self):
        doc_element = self.document.part._element
        bookmarks_list = doc_element.findall('.//' + qn('w:bookmarkStart'))
        for bookmark in bookmarks_list:
            par = bookmark.getparent()
            runs = par.findall(qn('w:r'))
            for run in runs:
                try:
                    print(' ', run.find(qn('w:t')).text, end=' ')
                except:
                    pass
            print('
','-'*50)

Am I missing something or do I need to find some other tags?

If not, how can I identify bookmarks in such scenarios?

Understanding bookmarks in docx file

Answers (1)

Related Questions