Bharath M Shetty
Bharath M Shetty

Reputation: 30605

Content of whole page is still present even after the pdf file is cropped

I cropped a pdf file with the help of pypdf2 but when i try to extract text from that cropped pdf file, Im getting the text of whole pdf page. How can I resolve that?

After cropping pdf file look like isenter image description here

But when I run command pdftotext out8.pdf out.txt

I get :

Contents Introduction Part I. Two Systems

  1. The Characters of the Story

  2. Attention and Effort

  3. The Lazy Controller

  4. The Associative Machine

  5. Cognitive Ease
  6. Norms, Surprises, and Causes
  7. A Machine for Jumping to Conclusions
  8. How Judgments Happen
  9. Answering an Easier Question Part II. Heuristics and Biases
  10. The Law of Small Numbers <5>
  11. Anchors
  12. The Science of Availability
  13. Availability, Emotion, and Risk
  14. Tom W’s Specialty

The output was supposed to be only

  1. The Characters of the Story

The code I ran

from PyPDF2 import PdfFileWriter, PdfFileReader
input1 = PdfFileReader(open("./data/in2.pdf", "rb"))
output = PdfFileWriter()

page = input1.getPage(1)
x = page.mediaBox.getUpperRight_x()
y = page.mediaBox.getUpperRight_y()

page.cropBox.lowerRight = (0,331-150)
page.cropBox.upperRight = (252,331)
output.addPage(page)

outputStream = open("out8.pdf", "wb")
output.write(outputStream)
outputStream.close()

Upvotes: 2

Views: 1563

Answers (1)

Alan
Alan

Reputation: 3042

Sounds like it is extracting the text from the text layer. PDFs can have more than one layer - if it is purely an image PDF, then it will just have the image layer but many have an image layer with a text layer. The text layer can be in front of the image, behind the image or not visible.

Unless the PDF has been prepared in a special way, the text layer does not align with the text seen in the image. If you have a multipage PDF, then the text may be split into the relevant pages but otherwise not arranged across the page.

When you crop the image, this does not affect the text layer. When you extract the text, this grabs it from the text layer which is intact.

In order to get just the text of the cropped section, you'll need to process it through an OCR engine e.g Tesseract. Examples of python packages which interface with Tesseract: pytesseract and tesserocr.

Some guides on how to set it up and run the processing:

ocr-on-pdf-files-using-python
ocr-python-easy

Upvotes: 3

Related Questions