Reputation: 533
I'm trying to parse a journal page using PDFBox. Here's a snippet of the code I'm using:
try (PDDocument document = PDDocument.load(new File("myfile.pdf"))) {
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
tStripper.setParagraphEnd("\n");
String pdfFileInText = tStripper.getText(document);
String output = "";
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
output += line + "\n";
}
}
}
The problem is, even though the paragraphs I get are ok, they show up in a completely random order. I need to get the paragraphs in natural order (top-bottom, left-right), but PDFBox seems to jump from one side of the page to the other for no real reason. My original PDF file also contains images at random positions, which I'm thinking might have something to do with this.
Here's a sample of the PDF that is not being read in order:
And here's what I get from that sample:
GALIZA>2-3
Analizamos os programas de PSOE, PP,
En Común-Unidas Podemos e do BNG
> Na Galiza hai case 15 librarías por cada
100.000 habitantes
> Só o 26% das persoas propietarias son
mulleres, fronte ao 74% de homes
A media de
traballadoras dunha
libraría e de 3,5
TRABALLO>15
Día das Librarías
As oito
medidas
electorais
para Galiza
Is there a way to get the paragraphs in natural order?
Upvotes: 3
Views: 1309
Reputation: 31724
Would this work for you?
PDDocument document = PDDocument.load(new File("myfile.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
ObjectMapper objectMapper = new ObjectMapper();
for (int p = 0; p <= document.getNumberOfPages(); p++) {
stripper.setStartPage(p);
stripper.setEndPage(p);
String text = stripper.getText(document);
System.out.println(text);
}
May be not using PDFTextStripperByArea
which uses heuristics. Just getting text and then formatting it? Could you try this?
As I said in comments, that it is difficult to assess without looking at pdf directly
Upvotes: 2