Reputation: 1129
I am looking for an elegant solution to find on what page(s) in a document a certain word occurs that I have stored in a python dictionary/list.
I first considered .docx format as an input and had a look at PythonDocx which has a search function, but there's obviously not really a pages attribute in the docx/xml format.
If I parse the document I could look for <w:br w:type="page"/>
occurrences in the xml tree but unfortunately these do not show non-forced page breaks.
I even considered converting files to PDF first and use something like PDFminer to parse the document page-wise.
Is there any straightforward solution to search a .docx document for a string and return the pages it occurs on like
[('foo' ,[1, 4, 7 ]), ('bar', [2]), ('baz', [2, 5, 8, 9 )]
Upvotes: 7
Views: 2691
Reputation: 2734
It seems that the biggest challenge in your question is how to be able to parse a document page by page. This answer of a word document is not always the same and it depends on the margins, the paper sheet settings, the application you use to open it etc. A good reasoning on the accuracy of any script for this purpose can be found at google group.
However, if you can be satisfied with a almost 100% accurate, you start to find a solution as suggested in this google group:
I found that I can unzip the .docx file and extract
docProps/app.xml
, then parse the XML with ElementTree to get the<Pages></Pages>
element. I found that most of the time that number is accurate, but I've seen a few instances where the number in that element is not correct.
Another approach could be to use win32com.client
to open the file, paginate it, make your search and then return the results in the format you want it.
You can find an example of the syntax in this answer:
from win32com.client import Dispatch
#open Word
word = Dispatch('Word.Application')
word.Visible = False
word = word.Documents.Open(doc_path)
#get number of sheets
word.Repaginate()
num_of_sheets = word.ComputeStatistics(2)
You can also have a look to this answer regarding find and replace in a word document using win32com.client.
Upvotes: 3