Bruno Ranieri
Bruno Ranieri

Reputation: 702

Parse Docx file content w.r.t. headings

I want to parse the structure of a docx file and its content using python-docx. The file ist structured using 'Heading 1' to 'Heading 6'. Under any heading content could be in form of an table element.

I understand how to extract the headings and the tables independent of each other, using python-docx:

    doc = Document("file.docx")
    for paragraph in doc.paragraphs:
        if paragraph.style == doc.styles['Heading 1']:
            indent = 1
            result.append('- %s' % paragraph.text.strip())
        elif paragraph.style == doc.styles['Heading 2']:
            indent = 2
            result.append('  ' * indent + '- %s:' % paragraph.text.strip())
        elif paragraph.style == doc.styles['Heading 3']:
            indent = 3
            result.append('  ' * indent + '- %s:' % paragraph.text.strip())
        [...]
        else:
            [...]

    for table in doc.tables:
        if _is_content(table.row_cells(0)[0].text):
            result.add_table(table)

My problem is preserving the structure. How does I find out under with heading a table is in the source document?

Upvotes: 0

Views: 1368

Answers (1)

abdulsaboor
abdulsaboor

Reputation: 718

You can extract the structured information from docx file using the xml. Try this:

doc = Document("file.docx")
headings = [] #extract only headings from your code
tables = [] #extract tables from your code
tags = []
all_text = []
schema = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
for elem in doc.element.getiterator():
    if elem.tag == schema + 'body':
        for i, child in enumerate(elem.getchildren()):
            if child.tag != schema + 'tbl':
                 node_text = child.text
                 if node_text:
                     if node_text in headings:
                         tags.append('heading')
                     else:
                         tags.append('text')
                     all_text.append(node_text)
             else:
                 tags.append('table')
        break

After above code you will have the list of tags which will show the structure of document heading,text and table then you can map the respective data from the lists.

Upvotes: 1

Related Questions