Kat hughes
Kat hughes

Reputation: 55

Parsing docx files in Python

I’m trying to read headings from multiple docx files. Annoyingly, these headings do not have an identifiable paragraph style. All paragraphs have ‘normal’ paragraph styling so I am using regex. The headings are formatted in bold and are structured as follows:

A. Cat

B. Dog

C. Pig

D. Fox

If there are more than 26 headings in a file then the headings would be preceded with ‘AA.’, ‘BB.’ etc

I have the following code, which kind of works except any heading preceded by ‘D.’ prints twice, e.g. [Cat, Dog, Pig, Fox, Fox]

import os
from docx import Document
import re

directory = input("Copy and paste the location of the files.\n").lower()

for file in os.listdir(directory):

    document = Document(directory+file)

    head1s = []

    for paragraph in document.paragraphs:

        heading = re.match(r'^[A-Z]+[.]\s', paragraph.text)

        for run in paragraph.runs:

            if run.bold:

                if heading:
                    head1 = paragraph.text
                    head1 = head1.split('.')[1]
                    head1s.append(head1)

    print(head1s)

Can anyone tell me if there is something wrong with the code that is causing this to happen? As far as I can tell, there is nothing unique about the formatting or structure of these particular headings in the Word files.

Upvotes: 3

Views: 10843

Answers (2)

user1890239
user1890239

Reputation: 79

You can also run use the style.name from the same library

def find_headings(doc_path):
#find headings in doc
doc = docx.Document(doc_path)
headings = []
for i, para in doc.paragraphs:
    if para.style.name == 'Heading 1':
        headings.append(para.text)
return headings

Upvotes: 0

glycoaddict
glycoaddict

Reputation: 929

what's happening is the the loop is continuing past D.Fox, and so in this new loop, even though there is no match, it is printing the last value of head1, which is D.Fox.

I think it is the for run in paragraph.runs: that is somehow running twice, maybe there's a second "run" that is there but invisible?

Perhaps adding a break when the first match is found is enough to prevent the second run triggering?

for file in os.listdir(directory):

document = Document(directory+file)

head1s = []

for paragraph in document.paragraphs:

    heading = re.match(r'^[A-Z]+[.]\s', paragraph.text)

    for run in paragraph.runs:

        if run.bold:

            if heading:
                head1 = paragraph.text
                head1 = head1.split('.')[1]
                head1s.append(head1)
                # this break stops the run loop if a match was found.
                break

print(head1s)

Upvotes: 2

Related Questions