Reputation: 434
I have a MS docx file and I need to extract text from it page-wise. I have tried with python-docx but it could extract the whole text but not pagewise. I have also converted my docx to pdf and then tried text extraction. The problem is, after conversion the page structure of docx got changed. For example, while converted,the font size got changed and the text content in one page of docx took more than one page in the pdf.
I was looking for a stable solution that would extract page wise text from docx (Without converting to pdf would be better for my whole solution). Can somebody help me on this?
Upvotes: 6
Views: 14463
Reputation: 31
import win32com.client
import comtypes.client
import pdfplumber
word = win32com.client.Dispatch('Word.Application')
wdFormatPDF = 17
in_file = Filepath
out_file = "out.pdf"
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
with pdfplumber.open(out_file) as pdf:
for page in pdf.pages:
out=page.extract_text()
print(out)
As far as I know, saving a pdf file with win32com is a 1:1 fork
Upvotes: -2
Reputation: 319
I faced a similar scenario recently. The following using docx2python
worked for me:
from docx2python import docx2python
doc_result = docx2python('page-wise-file.docx')
count = 0
para = 0
pages= []
while para < len(doc_result.body[0][0][0]):
if doc_result.body[0][0][0][para] != "":
current_page = {}
current_page_paras = []
count+=1
while doc_result.body[0][0][0][para]!= "" and para<len(doc_result.body[0][0][0]):
current_page_paras.append(doc_result.body[0][0][0][para])
para+=1
current_page["page_text"] = "\n".join(current_page_paras)
current_page["page_no"] = count
pages.append(current_page)
else:
para+=1
Although this will lead to losing any formatting information or any other metadata from the text, if extracting text is the only aim then this should work.
As Gerd mentioned, converting the file to PDF and then processing it can also help since libraries like PyPDF2 allow you to read individual pages, for example:
from PyPDF2 import PdfFileReader
pdf = PdfFileReader(open("page-wise-file.pdf", "rb"))
page = pdf.getPage(0)
page.extractText()
Upvotes: 3
Reputation: 434
I found that Tika library had a xmlContent parsing when reading the file. I used it to capture xml format and used regex to capture it. Writing below the python code that worked for me.
raw_xml = parser.from_file(file, xmlContent=True)
body = raw_xml['content'].split('<body>')[1].split('</body>')[0]
body_without_tag = body.replace("<p>", "").replace("</p>", "").replace("<div>", "").replace("</div>","").replace("<p />","")
text_pages = body_without_tag.split("""<div class="page">""")[1:]
num_pages = len(text_pages)
if num_pages==int(raw_xml['metadata']['xmpTPg:NPages']) : #check if it worked correctly
return text_pages
Upvotes: 0
Reputation: 2803
It seems to me that the docx format (and therefore also the python docx library) only supports paragraphs and sections.
Microsoft Word does not support the concept of hard pages. Instead, when the exported document is opened in Word, Word repaginates it again based on the page size. (source)
So in fact the pagination is not stored in the docx file, but rather carried out by the rendering engine:
DOCX files contain no information about pagination. You won’t find the number of pages in the document unless you calculate how much space you need for each line to ascertain the number of pages. (source)
This page has some more background and recommends to use PDF if pagination must be kept.
Upvotes: 6
Reputation: 16
try this
from docx import Document
document = Document('anydoccumnet.docx')
for para in document.paragraphs:
print(para.text)
Upvotes: -4