Reputation: 1
I'm analysing a pdf and for some reason many of the words have random spaces in or none between after I move it to python. I'm using PdfReader from PyPDF2.
Examples:
Y ou’re sweet, but I feel fine.
I wish I feltas calmas you look.
The strange thing is, the spaces aren't present (or not present) in the pdf, but only after I collect it in python.
So my proposed solution is a grammar or spellchecking module that will look at some text like 'y ou' and make it 'you' (and 'asif' to 'as if'). It would be great if there were a way to only enable that spellchecking feature, because I don't want it to change other things in the pdf.
I welcome any other solutions (perhaps in the way I'm collecting the pdf).
My current code looks like this:
def all_pages1(num, start, stop):
global file
with open(f'example{num}.txt', 'w') as file:
path = "C:/example.pdf"
with open(path, mode = 'rb') as file2:
reader = PdfReader(file2)
for page in range(start, stop):
page1 = reader.pages[page]
text = page1.extractText()
main(num, text)
file2.close()
file.close()
pass
main()
does the actual searching that isn't relevant to my problem.
Upvotes: 0
Views: 149
Reputation: 9057
disclaimer: I am the author of borb
, the library used in this answer.
PDF is not a WYSISYG (what you see is what you get) format.
If you open a webpage, you can expect to see <p>
elements containing text exactly as it is rendered on the page (and conversely, exactly as you would expect to extract it).
In a PDF however, you will find rendering instructions. In pseudo-code, you would find something like:
important spaces can be realized simply by moving to the left, rather than actually rendering the character <space>
.
Whenever a PDF library needs to extract text from a PDF, it will essentially loop over all rendering instructions and store them. It will then sort them in logical reading order (top to bottom, left to right).
Then it needs to determine whether to insert a space between the previously extracted text and the next character. To do so, it will ask the active font "how big is a space character?", it will compare that to the distance between the previous character and the new one.
e.g.
'AB' : the horizontal distance is 5, the space width of Helvetica 12 is 120, the characters do not need a space between them 'A B' : the horizontal distance is 125, hence a space is inserted
Fonts are a mess in PDF. So I imagine the font in your PDF documents might simply be "broken". Which then causes text-extraction algorithms to have to "guess" the width of a space character.
There are various ways of doing this:
All of these might be reasons why text-extraction is failing.
You can try borb
to see whether that fixes the problem.
#!chapter_005/src/snippet_005.py
import typing
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import SimpleTextExtraction
def main():
# read the Document
doc: typing.Optional[Document] = None
l: SimpleTextExtraction = SimpleTextExtraction()
with open("output.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [l])
# check whether we have read a Document
assert doc is not None
# print the text on the first Page
print(l.get_text()[0])
if __name__ == "__main__":
main()
Upvotes: 1