Reputation: 136865
PDF is very nice for humans to read, but it is pretty awful to extract the data from. There are tons of tools to extract the data from PDF (pdftotext from poppler, pdftohtml, XPdf, tabula, a-pdf, ...).
As you can see in questions like this, those tools are not optimal.
It would be better if the PDF contained already the data in a structured way to be extracted. Something like a striped-down version of HTML. Especially for tables, there is a lot of information lost. For example, when you convert a Word document to PDF and then to text.
Does the PDF standard provide a way to store the structure of a table? If not, is it possible to extend the PDF standard? What would be the process for that?
Upvotes: 1
Views: 77
Reputation: 96064
What you are looking for, most likely are tagged PDFs.
Tagged PDFs are specified in ISO 32000-1, section 14.8. They mark content parts as paragraphs, headers, lists (and list items), tables (and table rows, headers, and data cells) etc. with assorted attributes.
To do so they make use of the PDF logical structure facilities (see ISO 32000-1, section 12.7) which in turn use the marked content operators (see ISO 32000-1, section 12.6) to tag pieces of content streams with IDs which are referenced from a structure tree object model outside the content streams.
In a tagged PDF you can walk that structure tree like a XML DOM and retrieve the associated text pieces making use of the ID markers in the content.
For details please study the PDF specification ISO 32000-1 or its update ISO 32000.2.
Adobe shared a copy of ISO 32000-1 (merely replacing ISO headers and references), simply search the web for "PDF32000_2008". Currently it's located here: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
Upvotes: 1