Extracting exact table data from PDF

Question

I am trying to extract each row of my table from a pdf file I created before.

The problem I have, is that empty cells, which I thought would be saved as 'null', are ignored, and not even read as space characters.

I extract the content from my PDF via this method:

    public final ArrayList extractLines(final File pdf) throws IOException {
    try (PDDocument doc = PDDocument.load(pdf)) {
        PDFTextStripper strip = new PDFTextStripper();
        String txt = strip.getText(doc);
        String[] arr = txt.split("
");
        final ArrayList lines = new ArrayList<>(Arrays.asList(arr));
        return lines;
    }
}

Is it even possible to extract the data with whitespaces?

If so, with PDFBox? Or a different method?

EDIT:

Cannot get traprange to work, simple test:

File e = new File("C:/Users/Test/Downloads/a.pdf");

    List t = new PDFTableExtractor().setSource(e).extract();
    System.out.println(t.get(0).toString());
Only gives me:
Could it have to do with the form of my table?
My table:

Dahlin · Accepted Answer

I came up with my own solution.

Since I have a 2D ArrayList, I each have a list containing a row of the table.

Now I save the position of the non empty cells (only one is not empty per row at any time).

I save it in a meta data field of the PDF and load this field to get the positions back.

Extracting exact table data from PDF

Answers (2)

Related Questions