PDFBox - 2.0.3 - PDFTextStripper picking up old text from page prior to cropping/rotating

Question

I'm attempting to perform some string validation against individual PDF pages in a file via the use of Apache PDFBox.

I'm going to be utilizing PDFTextStripper for the majority of this, so my first issue to tackle was the fact that all the PDFs i'm going to be validating against are generated as 2up; e.g Page 1 of 2 and page 2 of 2 were on the same page or if you imagine you literally scanned a book face down into a scanner - In addition to this, they were oriented incorrectly, and needed rotating 90 degrees so PDFTextStripper could read them properly.

Using elements of the below questions/solutions, i have built a method which first crops the page exactly in half, exports the cropped pages in order to a new file, rotates each page to the correct orientation and then saves the file;

Rotate PDF around its center using PDFBox in java

Split a PDF page in two parts [duplicate]

Visually, my method is seemingly working as expected until i run PDFTextStripper against it - It appears to be returning the text of not just the page i want, but also the page i cropped out of it.

To confirm the issue, I extracted a single page out of the entire document and saved it as a new file - when running PDFTextStripper, i still get the same results even though all i can see is literally one page. Adobe search doesn't bring up the hidden, legacy data either.

I can only assume that during my transform method, i need to redefine the cropped page with only the contents of the cropped page.

My question is, how can i do this?

p.s - i haven't posted my code as it's basically a amalgamation of the solutions provided in the aforementioned links above - however if it i needed, i can provide

mkl · Accepted Answer

The PDFTextStripper ignores the CropBox you set to crop the pages. It also ignores whether text is covered by some filled rectangle or image or whether the text is invisible, it extracts all text (except text in patterns or contains in Type 3 font characters).

You might want to try the PDFTextStripperByArea instead. This class (which is derived from PDFTextStripper) restricts itself to regions you can define.

(Unfortunately these regions have to be defined using a different coordinate system than the one used for the CropBox, so usually you will have to transform the coordinates first.)

PDFBox - 2.0.3 - PDFTextStripper picking up old text from page prior to cropping/rotating

Answers (1)

Related Questions