Reputation: 2613
I have a word document that is structured as follows:
1. Heading
1.1. Sub-heading
(a) Sub-sub-heading
When I load the document in docx
using the code:
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
print(getText("a.docx"))
I get the following output.
Heading
Sub-heading
Sub-sub-heading
How can I extract the heading/sub-heading numbers also along with the text? I tried simplify_docx but that only works for standard MS Word heading styles and not on custom heading styles.
Upvotes: 1
Views: 2331
Reputation: 944
Unfortunately numbers are not part of the text but are generated by Word itself based on the heading style (Heading i
), and I don't thing docx
exposes any way to get this number.
However you can retrieve the style / level using para.style
and then read through the document to recompute the numbering scheme. This is however cumbersome as it doesn't take into account any custom style you could be using. There might be a way to access the numbering scheme in the style.xml
part of the doc but I don't know how.
import docx
level_from_style_name = {f'Heading {i}': i for i in range(10)}
def format_levels(cur_lev):
levs = [str(l) for l in cur_lev if l != 0]
return '.'.join(levs) # Customize your format here
d = docx.Document('my_doc.docx')
current_levels = [0] * 10
full_text = []
for p in d.paragraphs:
if p.style.name not in level_from_style_name:
full_text.append(p.text)
else:
level = level_from_style_name[p.style.name]
current_levels[level] += 1
for l in range(level + 1, 10):
current_levels[l] = 0
full_text.append(format_levels(current_levels) + ' ' + p.text)
for l in full_text:
print(l)
which from
gives me
Hello world
1 H1 foo
1.1 H2 bar
1.1.1 H3 baz
Paragraph are really nice !
1.1.2 H3 bibou
Something else
2 H1 foofoo
You got the drill…
Upvotes: 5