Haroon S.
Haroon S.

Reputation: 2613

python-docx: Extracting text along with heading and sub-heading numbers

I have a word document that is structured as follows:

1. Heading
    1.1. Sub-heading
        (a) Sub-sub-heading

When I load the document in docx using the code:

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)
print(getText("a.docx"))

I get the following output.

Heading
Sub-heading
Sub-sub-heading

How can I extract the heading/sub-heading numbers also along with the text? I tried simplify_docx but that only works for standard MS Word heading styles and not on custom heading styles.

Upvotes: 1

Views: 2331

Answers (1)

Big Bro
Big Bro

Reputation: 944

Unfortunately numbers are not part of the text but are generated by Word itself based on the heading style (Heading i), and I don't thing docx exposes any way to get this number.

However you can retrieve the style / level using para.style and then read through the document to recompute the numbering scheme. This is however cumbersome as it doesn't take into account any custom style you could be using. There might be a way to access the numbering scheme in the style.xml part of the doc but I don't know how.

import docx

level_from_style_name = {f'Heading {i}': i for i in range(10)}

def format_levels(cur_lev):
    levs = [str(l) for l in cur_lev if l != 0]
    return '.'.join(levs)  # Customize your format here

d = docx.Document('my_doc.docx')

current_levels = [0] * 10
full_text = []

for p in d.paragraphs:
    if p.style.name not in level_from_style_name:
        full_text.append(p.text)
    else:
        level = level_from_style_name[p.style.name]
        current_levels[level] += 1
        for l in range(level + 1, 10):
            current_levels[l] = 0
        full_text.append(format_levels(current_levels) + ' ' + p.text)

for l in full_text:
    print(l)

which from

enter image description here

gives me

Hello world
1 H1 foo
1.1 H2 bar
1.1.1 H3 baz
Paragraph are really nice !
1.1.2 H3 bibou
Something else
2 H1 foofoo
You got the drill…

Upvotes: 5

Related Questions