Reputation: 702
I want to parse the structure of a docx file and its content using python-docx. The file ist structured using 'Heading 1' to 'Heading 6'. Under any heading content could be in form of an table element.
I understand how to extract the headings and the tables independent of each other, using python-docx:
doc = Document("file.docx")
for paragraph in doc.paragraphs:
if paragraph.style == doc.styles['Heading 1']:
indent = 1
result.append('- %s' % paragraph.text.strip())
elif paragraph.style == doc.styles['Heading 2']:
indent = 2
result.append(' ' * indent + '- %s:' % paragraph.text.strip())
elif paragraph.style == doc.styles['Heading 3']:
indent = 3
result.append(' ' * indent + '- %s:' % paragraph.text.strip())
[...]
else:
[...]
for table in doc.tables:
if _is_content(table.row_cells(0)[0].text):
result.add_table(table)
My problem is preserving the structure. How does I find out under with heading a table is in the source document?
Upvotes: 0
Views: 1368
Reputation: 718
You can extract the structured information from docx file using the xml. Try this:
doc = Document("file.docx")
headings = [] #extract only headings from your code
tables = [] #extract tables from your code
tags = []
all_text = []
schema = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
for elem in doc.element.getiterator():
if elem.tag == schema + 'body':
for i, child in enumerate(elem.getchildren()):
if child.tag != schema + 'tbl':
node_text = child.text
if node_text:
if node_text in headings:
tags.append('heading')
else:
tags.append('text')
all_text.append(node_text)
else:
tags.append('table')
break
After above code you will have the list of tags which will show the structure of document heading,text and table then you can map the respective data from the lists.
Upvotes: 1