Reputation: 817
I have a Microsoft word document and I need to extract the text and structure it into a data frame by each section of the document. Each section of the document starts with a Heading. The heading is formatted in Word as "Heading 2". For example:
This is section one
This is the text for the first section.
This is the second section of the document
And this is the text for the second section.
I need to get the text for each section in a data frame where in column A I would have the section name and in column B I would have the section text.
I am new to Python and I am trying docx
package but the only think I was able to do was to get the full text based on a function I found in stackoverflow
Function (readDocx):
#! python3
from docx import Document
def getText(filename):
doc = Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
Code to get the text:
import readDocx
test = readDocx.getText('THE FILE.docx')
I was able to find this loop that identifies the headings. The problem is how to iterate through the document and get each heading and text in a dataframe :
from docx import Document
from docx.shared import Inches
docs = Document("THE FILE.docx")
for paragraph in docs.paragraphs:
if paragraph.style.name=='Heading 2':
print (paragraph.text)
Upvotes: 4
Views: 10294
Reputation: 718
Use this:
from docx import Document
from docx.shared import Inches
document = Document("demo.docx")
headings = []
texts = []
para = []
for paragraph in document.paragraphs:
if paragraph.style.name.startswith("Heading"):
if headings:
texts.append(para)
headings.append(paragraph.text)
para = []
elif paragraph.style.name == "Normal":
para.append(paragraph.text)
if para or len(headings)>len(texts):
texts.append(texts.append(para))
for h, t in zip(headings, texts):
print(h, t)
Upvotes: 1
Reputation:
For a docx
that looks like this
this could be a starting point:
from docx import Document
from docx.shared import Inches
document = Document("demo.docx")
headings = []
texts = []
for paragraph in document.paragraphs:
if paragraph.style.name == "Heading 2":
headings.append(paragraph.text)
elif paragraph.style.name == "Normal":
texts.append(paragraph.text)
for h, t in zip(headings, texts):
print(h, t)
Output:
Heading, level 2 A plain paragraph having some bold and some italic.
Heading, level 2 Foo
Heading, level 2 Bar
I don't know Pandas but it should be easy to get from a list of tuples (produced by zip
) to a dataframe.
Upvotes: 1