Reputation: 177
I am trying to extract each row of my table from a pdf file I created before.
The problem I have, is that empty cells, which I thought would be saved as 'null', are ignored, and not even read as space characters.
I extract the content from my PDF via this method:
public final ArrayList<String> extractLines(final File pdf) throws IOException {
try (PDDocument doc = PDDocument.load(pdf)) {
PDFTextStripper strip = new PDFTextStripper();
String txt = strip.getText(doc);
String[] arr = txt.split("\n");
final ArrayList<String> lines = new ArrayList<>(Arrays.asList(arr));
return lines;
}
}
Is it even possible to extract the data with whitespaces?
If so, with PDFBox? Or a different method?
EDIT:
Cannot get traprange to work, simple test:
File e = new File("C:/Users/Test/Downloads/a.pdf");
List<Table> t = new PDFTableExtractor().setSource(e).extract();
System.out.println(t.get(0).toString());
Only gives me:
Could it have to do with the form of my table?
My table:
Upvotes: 0
Views: 2660
Reputation: 177
I came up with my own solution.
Since I have a 2D ArrayList, I each have a list containing a row of the table.
Now I save the position of the non empty cells (only one is not empty per row at any time).
I save it in a meta data field of the PDF and load this field to get the positions back.
Upvotes: 0
Reputation: 1202
The solution needs custom algorithm to complete the task. Please check this solution for custom PDFTableStripper.
Another great solution has been implemented by Tho which could be found at traprage. It can extract the null data of a particular cell.
Upvotes: 2