Reputation: 344
I need to extract data from similarly formatted tables from this file. There are some OCR errors but I have an automated method to correct them.
I have tried:
The Problem: The commercials tools are very bad with detecting the edges of the table. The tables follow a similar general format but each scan is aligned slightly differently so hard coding the boarders won't work either.
Question: Do you guys know a good way to detect where the table begins and then apply one of a few templates?
Any other tips for this kind of work are greatly appreciated.
Upvotes: 1
Views: 2183
Reputation: 344
UPDATE 2/26: I solved my own question, though feel free to respond with fast or better solutions.
One of the main problems is that the tables are roughly similar in their dimensions but they vary from page to page. The scanned images are also slightly offset from page to page, giving two alignment problems. My current workflow solves both and is as follows.
Solution:
The images of the same table type are still not aligned so specifying a table layout in (x,y) coordinates won't work. The tables locations are in different in each image.
I needed to align the images based on the table location, but without already detecting the table there was no good way to do that.
I solved the problem in an interesting way, but I tried the following steps first.
Solution:
After having cut images into tables explained in Table Type Alignment section, use the Auto align layers feature in Photoshop to align the images.
Step-by-Step Solution:
Done! Combine the files for each table however you like. I will post my python code for doing this when I'm done with the project. Once cleaned, I will post the data too.
Upvotes: 4
Reputation: 3536
Instead of Camelot table_areas parameter (which specifies fixed boundaries), you can try to use table_regions parameter to specify the regions where the tables probably are (Camelot will only analyze the specified regions to look for tables).
https://camelot-py.readthedocs.io/en/master/user/advanced.html#specify-table-regions
Please keep us updated.
Upvotes: 0
Reputation: 406
There is a free online tool here https://www.pdftron.com/pdf-tools/pdf-table-extraction/
The related blog https://www.pdftron.com/blog/parsing-extraction/table-extraction-and-pdf-to-xml-with-pdfgenie/ references PDFGenie command line tool
Upvotes: 1