Reputation: 4966
I am having issues with coordinates. The PDFTextStripperByArea region seems to be pushed too high.
Consider the following example snippet:
...
PDPage page = (PDPage) allPages.get(0);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
// define region for extraction -- the coordinates and dimensions are x, y, width, height
Rectangle2D.Float region = new Rectangle2D.Float(x, y, width, height);
stripper.addRegion("test region", region);
// overlay the region with a cyan rectangle to check if I got the coordinates and dimensions right
PDPageContentStream contentStream = new PDPageContentStream(document, page, true, true);
contentStream.setNonStrokingColor( Color.CYAN );
contentStream.fillRect(x, y, width, height );
contentStream.close();
// extract the text from the defined region
stripper.extractRegions(page);
String content = stripper.getTextForRegion("test region");
...
document.save(...); ...
The cyan rectangle overlays the desired region nicely. On the other hand, stripper misses a couple of lines at the bottom of the rectangle and includes couple of lines above the rectangle -- it looks like it is shifted "upwards" (by y coordinate). What is going on?
Upvotes: 4
Views: 3359
Reputation: 304
As Christian said in his comment, the problem is that the coordinate system for the fillRect() method and the one for the PDFTextStripperByArea are different.
The first expects the origin to be the lower-left corner of the page, while the second expects it to be the upper-left.
So, to make it work, change the region given to the PDFTextStripperByArea to:
Rectangle2D.Float region = new Rectangle2D.Float(x, ph - y - height, width, height);
where ph is the page height:
float ph = page.getMediaBox().getUpperRightY();
PS: I know this is a very old question, but Google brought me here when I faced the same problem, so I will add my answer.
Upvotes: 4
Reputation: 4966
Text is usually contained inside a positioning rectangle. Sometimes, the text is not at the expected position inside that rectangle, and PDFBox uses that rectangle to try and guess where the text is located. So if text starts outside the capture area and flows into it, it might not be extracted.
Rough sketch: Textbox starts outside the capture area but text flows inside it. It might not be captured.
____________
|Page |
| _______|
| |Area ||
| | ||
| ..|.....||
| ⁞ |Text⁞||
| ⁞ |____⁞||
| ⁞......⁞ |
|__________|
Upvotes: 1