Does the PDF standard provide a way to store extractable (semantic) text?

Question

PDF is very nice for humans to read, but it is pretty awful to extract the data from. There are tons of tools to extract the data from PDF (pdftotext from poppler, pdftohtml, XPdf, tabula, a-pdf, ...).

As you can see in questions like this, those tools are not optimal.

It would be better if the PDF contained already the data in a structured way to be extracted. Something like a striped-down version of HTML. Especially for tables, there is a lot of information lost. For example, when you convert a Word document to PDF and then to text.

Does the PDF standard provide a way to store the structure of a table? If not, is it possible to extend the PDF standard? What would be the process for that?

Does the PDF standard provide a way to store extractable (semantic) text?

Answers (1)

Related Questions