Reputation: 55
I’m trying to read headings from multiple docx files. Annoyingly, these headings do not have an identifiable paragraph style. All paragraphs have ‘normal’ paragraph styling so I am using regex. The headings are formatted in bold and are structured as follows:
A. Cat
B. Dog
C. Pig
D. Fox
If there are more than 26 headings in a file then the headings would be preceded with ‘AA.’, ‘BB.’ etc
I have the following code, which kind of works except any heading preceded by ‘D.’ prints twice, e.g. [Cat, Dog, Pig, Fox, Fox]
import os
from docx import Document
import re
directory = input("Copy and paste the location of the files.\n").lower()
for file in os.listdir(directory):
document = Document(directory+file)
head1s = []
for paragraph in document.paragraphs:
heading = re.match(r'^[A-Z]+[.]\s', paragraph.text)
for run in paragraph.runs:
if run.bold:
if heading:
head1 = paragraph.text
head1 = head1.split('.')[1]
head1s.append(head1)
print(head1s)
Can anyone tell me if there is something wrong with the code that is causing this to happen? As far as I can tell, there is nothing unique about the formatting or structure of these particular headings in the Word files.
Upvotes: 3
Views: 10843
Reputation: 79
You can also run use the style.name from the same library
def find_headings(doc_path):
#find headings in doc
doc = docx.Document(doc_path)
headings = []
for i, para in doc.paragraphs:
if para.style.name == 'Heading 1':
headings.append(para.text)
return headings
Upvotes: 0
Reputation: 929
what's happening is the the loop is continuing past D.Fox, and so in this new loop, even though there is no match, it is printing the last value of head1, which is D.Fox.
I think it is the for run in paragraph.runs:
that is somehow running twice, maybe there's a second "run" that is there but invisible?
Perhaps adding a break when the first match is found is enough to prevent the second run triggering?
for file in os.listdir(directory):
document = Document(directory+file)
head1s = []
for paragraph in document.paragraphs:
heading = re.match(r'^[A-Z]+[.]\s', paragraph.text)
for run in paragraph.runs:
if run.bold:
if heading:
head1 = paragraph.text
head1 = head1.split('.')[1]
head1s.append(head1)
# this break stops the run loop if a match was found.
break
print(head1s)
Upvotes: 2