AlfiyaFaisy
AlfiyaFaisy

Reputation: 434

Extraction of text page by page from MS word docx file using python

I have a MS docx file and I need to extract text from it page-wise. I have tried with python-docx but it could extract the whole text but not pagewise. I have also converted my docx to pdf and then tried text extraction. The problem is, after conversion the page structure of docx got changed. For example, while converted,the font size got changed and the text content in one page of docx took more than one page in the pdf.

I was looking for a stable solution that would extract page wise text from docx (Without converting to pdf would be better for my whole solution). Can somebody help me on this?

Upvotes: 6

Views: 14463

Answers (5)

user18677603
user18677603

Reputation: 31

import win32com.client
import comtypes.client
import pdfplumber
word = win32com.client.Dispatch('Word.Application')
wdFormatPDF = 17
in_file = Filepath
out_file = "out.pdf"
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
with pdfplumber.open(out_file) as pdf:       
    for page in pdf.pages:
        out=page.extract_text()            
        print(out)

    
        

As far as I know, saving a pdf file with win32com is a 1:1 fork

Upvotes: -2

Ishank Saxena
Ishank Saxena

Reputation: 319

I faced a similar scenario recently. The following using docx2python worked for me:

from docx2python import docx2python
doc_result = docx2python('page-wise-file.docx')
count = 0
para = 0
pages= []
while para < len(doc_result.body[0][0][0]):
    if doc_result.body[0][0][0][para] != "":
        current_page = {}
        current_page_paras = []
        count+=1
        while doc_result.body[0][0][0][para]!= "" and para<len(doc_result.body[0][0][0]):
            current_page_paras.append(doc_result.body[0][0][0][para])
            para+=1
        current_page["page_text"] = "\n".join(current_page_paras)
        current_page["page_no"] = count
        pages.append(current_page)
    else:
        para+=1

Although this will lead to losing any formatting information or any other metadata from the text, if extracting text is the only aim then this should work.

As Gerd mentioned, converting the file to PDF and then processing it can also help since libraries like PyPDF2 allow you to read individual pages, for example:

from PyPDF2 import PdfFileReader
pdf = PdfFileReader(open("page-wise-file.pdf", "rb"))
page = pdf.getPage(0)
page.extractText()

Upvotes: 3

AlfiyaFaisy
AlfiyaFaisy

Reputation: 434

I found that Tika library had a xmlContent parsing when reading the file. I used it to capture xml format and used regex to capture it. Writing below the python code that worked for me.

raw_xml = parser.from_file(file, xmlContent=True)
body = raw_xml['content'].split('<body>')[1].split('</body>')[0]
body_without_tag = body.replace("<p>", "").replace("</p>", "").replace("<div>", "").replace("</div>","").replace("<p />","")
text_pages = body_without_tag.split("""<div class="page">""")[1:]
num_pages = len(text_pages)
if num_pages==int(raw_xml['metadata']['xmpTPg:NPages']) : #check if it worked correctly
     return text_pages

Upvotes: 0

Gerd
Gerd

Reputation: 2803

It seems to me that the docx format (and therefore also the python docx library) only supports paragraphs and sections.

Microsoft Word does not support the concept of hard pages. Instead, when the exported document is opened in Word, Word repaginates it again based on the page size. (source)

So in fact the pagination is not stored in the docx file, but rather carried out by the rendering engine:

DOCX files contain no information about pagination. You won’t find the number of pages in the document unless you calculate how much space you need for each line to ascertain the number of pages. (source)

This page has some more background and recommends to use PDF if pagination must be kept.

Upvotes: 6

Debi
Debi

Reputation: 16

try this


from docx import Document

document = Document('anydoccumnet.docx')
for para in document.paragraphs:
    print(para.text)

Upvotes: -4

Related Questions