Read paragraphs in order in PDFBox

Question

I'm trying to parse a journal page using PDFBox. Here's a snippet of the code I'm using:

try (PDDocument document = PDDocument.load(new File("myfile.pdf"))) {

    if (!document.isEncrypted()) {

        PDFTextStripperByArea stripper = new PDFTextStripperByArea();
        stripper.setSortByPosition(true);

        PDFTextStripper tStripper = new PDFTextStripper();
        tStripper.setParagraphEnd("
");

        String pdfFileInText = tStripper.getText(document);

        String output = "";

        String lines[] = pdfFileInText.split("\r?\n");
        for (String line : lines) {
            output += line + "
";
        }

    }

}

The problem is, even though the paragraphs I get are ok, they show up in a completely random order. I need to get the paragraphs in natural order (top-bottom, left-right), but PDFBox seems to jump from one side of the page to the other for no real reason. My original PDF file also contains images at random positions, which I'm thinking might have something to do with this.

Here's a sample of the PDF that is not being read in order:

And here's what I get from that sample:

GALIZA>2-3
Analizamos os programas de PSOE, PP, 
En Común-Unidas Podemos e do BNG

> Na Galiza hai case 15 librarías por cada 
100.000 habitantes

> Só o 26% das persoas propietarias son 
mulleres, fronte ao 74% de homes 

A media de 
traballadoras dunha 
libraría e de 3,5

TRABALLO>15
Día das Librarías

As oito 
medidas 
electorais 
para Galiza

Is there a way to get the paragraphs in natural order?

Read paragraphs in order in PDFBox

Answers (1)

Related Questions