David Antelo
David Antelo

Reputation: 533

Read paragraphs in order in PDFBox

I'm trying to parse a journal page using PDFBox. Here's a snippet of the code I'm using:

try (PDDocument document = PDDocument.load(new File("myfile.pdf"))) {

    if (!document.isEncrypted()) {

        PDFTextStripperByArea stripper = new PDFTextStripperByArea();
        stripper.setSortByPosition(true);

        PDFTextStripper tStripper = new PDFTextStripper();
        tStripper.setParagraphEnd("\n");

        String pdfFileInText = tStripper.getText(document);

        String output = "";

        String lines[] = pdfFileInText.split("\\r?\\n");
        for (String line : lines) {
            output += line + "\n";
        }

    }

}

The problem is, even though the paragraphs I get are ok, they show up in a completely random order. I need to get the paragraphs in natural order (top-bottom, left-right), but PDFBox seems to jump from one side of the page to the other for no real reason. My original PDF file also contains images at random positions, which I'm thinking might have something to do with this.

Here's a sample of the PDF that is not being read in order:

And here's what I get from that sample:

GALIZA>2-3
Analizamos os programas de PSOE, PP, 
En Común-Unidas Podemos e do BNG

> Na Galiza hai case 15 librarías por cada 
100.000 habitantes

> Só o 26% das persoas propietarias son 
mulleres, fronte ao 74% de homes 

A media de 
traballadoras dunha 
libraría e de 3,5

TRABALLO>15
Día das Librarías

As oito 
medidas 
electorais 
para Galiza

Is there a way to get the paragraphs in natural order?

Upvotes: 3

Views: 1309

Answers (1)

Jatin
Jatin

Reputation: 31724

Would this work for you?

        PDDocument document = PDDocument.load(new File("myfile.pdf"));
        PDFTextStripper stripper = new PDFTextStripper();
        stripper.setSortByPosition(true);
        ObjectMapper objectMapper = new ObjectMapper();

        for (int p = 0; p <= document.getNumberOfPages(); p++) {
            stripper.setStartPage(p);
            stripper.setEndPage(p);
            String text = stripper.getText(document);
            System.out.println(text);
        }

May be not using PDFTextStripperByArea which uses heuristics. Just getting text and then formatting it? Could you try this?

As I said in comments, that it is difficult to assess without looking at pdf directly

Upvotes: 2

Related Questions