Bruno Guarita
Bruno Guarita

Reputation: 817

Read Word Document and get text for each heading

I have a Microsoft word document and I need to extract the text and structure it into a data frame by each section of the document. Each section of the document starts with a Heading. The heading is formatted in Word as "Heading 2". For example:

This is section one

This is the text for the first section.

This is the second section of the document

And this is the text for the second section.

I need to get the text for each section in a data frame where in column A I would have the section name and in column B I would have the section text.

I am new to Python and I am trying docx package but the only think I was able to do was to get the full text based on a function I found in stackoverflow

Function (readDocx):

#! python3
from docx import Document

def getText(filename):
    doc = Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

Code to get the text:

import readDocx

test = readDocx.getText('THE FILE.docx')

I was able to find this loop that identifies the headings. The problem is how to iterate through the document and get each heading and text in a dataframe :

from docx import Document
from docx.shared import Inches


docs = Document("THE FILE.docx")

for paragraph in docs.paragraphs:
    if paragraph.style.name=='Heading 2':
        print (paragraph.text)

Upvotes: 4

Views: 10294

Answers (2)

abdulsaboor
abdulsaboor

Reputation: 718

Use this:

from docx import Document
from docx.shared import Inches

document = Document("demo.docx")
headings = []
texts = []
para = []
for paragraph in document.paragraphs:
    if paragraph.style.name.startswith("Heading"):
        if headings:
            texts.append(para)
        headings.append(paragraph.text)
        para = []
    elif paragraph.style.name == "Normal":
        para.append(paragraph.text)
if para or len(headings)>len(texts):
    texts.append(texts.append(para))

for h, t in zip(headings, texts):
    print(h, t)

Upvotes: 1

user9455968
user9455968

Reputation:

For a docx that looks like this

enter image description here

this could be a starting point:

from docx import Document
from docx.shared import Inches

document = Document("demo.docx")
headings = []
texts = []
for paragraph in document.paragraphs:
    if paragraph.style.name == "Heading 2":
        headings.append(paragraph.text)
    elif paragraph.style.name == "Normal":
        texts.append(paragraph.text)

for h, t in zip(headings, texts):
    print(h, t)

Output:

Heading, level 2 A plain paragraph having some bold and some italic.
Heading, level 2 Foo
Heading, level 2 Bar

I don't know Pandas but it should be easy to get from a list of tuples (produced by zip) to a dataframe.

Upvotes: 1

Related Questions