Reputation: 30605
I cropped a pdf file with the help of pypdf2 but when i try to extract text from that cropped pdf file, Im getting the text of whole pdf page. How can I resolve that?
After cropping pdf file look like is
But when I run command pdftotext out8.pdf out.txt
I get :
Contents Introduction Part I. Two Systems
The Characters of the Story
Attention and Effort
The Lazy Controller
The Associative Machine
The output was supposed to be only
The code I ran
from PyPDF2 import PdfFileWriter, PdfFileReader
input1 = PdfFileReader(open("./data/in2.pdf", "rb"))
output = PdfFileWriter()
page = input1.getPage(1)
x = page.mediaBox.getUpperRight_x()
y = page.mediaBox.getUpperRight_y()
page.cropBox.lowerRight = (0,331-150)
page.cropBox.upperRight = (252,331)
output.addPage(page)
outputStream = open("out8.pdf", "wb")
output.write(outputStream)
outputStream.close()
Upvotes: 2
Views: 1563
Reputation: 3042
Sounds like it is extracting the text from the text layer. PDFs can have more than one layer - if it is purely an image PDF, then it will just have the image layer but many have an image layer with a text layer. The text layer can be in front of the image, behind the image or not visible.
Unless the PDF has been prepared in a special way, the text layer does not align with the text seen in the image. If you have a multipage PDF, then the text may be split into the relevant pages but otherwise not arranged across the page.
When you crop the image, this does not affect the text layer. When you extract the text, this grabs it from the text layer which is intact.
In order to get just the text of the cropped section, you'll need to process it through an OCR engine e.g Tesseract. Examples of python packages which interface with Tesseract: pytesseract and tesserocr.
Some guides on how to set it up and run the processing:
ocr-on-pdf-files-using-python
ocr-python-easy
Upvotes: 3