Content of whole page is still present even after the pdf file is cropped

Question

I cropped a pdf file with the help of pypdf2 but when i try to extract text from that cropped pdf file, Im getting the text of whole pdf page. How can I resolve that?

After cropping pdf file look like is

But when I run command pdftotext out8.pdf out.txt

I get :

Contents Introduction Part I. Two Systems

The Characters of the Story
Attention and Effort
The Lazy Controller
The Associative Machine
Cognitive Ease
Norms, Surprises, and Causes
A Machine for Jumping to Conclusions
How Judgments Happen
Answering an Easier Question Part II. Heuristics and Biases
The Law of Small Numbers <5>
Anchors
The Science of Availability
Availability, Emotion, and Risk
Tom W’s Specialty

The output was supposed to be only

The Characters of the Story

The code I ran

from PyPDF2 import PdfFileWriter, PdfFileReader
input1 = PdfFileReader(open("./data/in2.pdf", "rb"))
output = PdfFileWriter()

page = input1.getPage(1)
x = page.mediaBox.getUpperRight_x()
y = page.mediaBox.getUpperRight_y()

page.cropBox.lowerRight = (0,331-150)
page.cropBox.upperRight = (252,331)
output.addPage(page)

outputStream = open("out8.pdf", "wb")
output.write(outputStream)
outputStream.close()

Alan · Accepted Answer

Sounds like it is extracting the text from the text layer. PDFs can have more than one layer - if it is purely an image PDF, then it will just have the image layer but many have an image layer with a text layer. The text layer can be in front of the image, behind the image or not visible.

Unless the PDF has been prepared in a special way, the text layer does not align with the text seen in the image. If you have a multipage PDF, then the text may be split into the relevant pages but otherwise not arranged across the page.

When you crop the image, this does not affect the text layer. When you extract the text, this grabs it from the text layer which is intact.

In order to get just the text of the cropped section, you'll need to process it through an OCR engine e.g Tesseract. Examples of python packages which interface with Tesseract: pytesseract and tesserocr.

Some guides on how to set it up and run the processing:

ocr-on-pdf-files-using-python
ocr-python-easy

Content of whole page is still present even after the pdf file is cropped

Answers (1)

Related Questions