Marcel
Marcel

Reputation: 6320

How to extract data from a PDF file while keeping track of its structure?

My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to identify headings and paragraphs.

I have tried a few of different things, but I did not get very far in any of them:

Anyone has any suggestions on how to tackle this problem?

Upvotes: 42

Views: 136285

Answers (8)

Aditya Parashar
Aditya Parashar

Reputation: 11

For this you can go with markdown of PDFs. The markdown text will allow you to capture heading for the PDFs content. You guys can use 'Docling' for this purpose as it guarantees all data capturing along with well-structured tables markdown and images references in markdown text.

  • Converts any PDF document to JSON/Markdown/DocTags format.
  • Extracts metadata from the document, such as title, authors, references and language.
  • Headings will be captured by '##' followed by content within the heading.
  • OCR support is also present.
  • Integrates easily with LLM app / RAG frameworks.
  • Inbuilt Hierarchical Chunking is also supported.

Docling Official Documentation - https://github.com/DS4SD/docling

Alternatively, you can also use PyMuPDF4llm which also does the same mostly but internal order and table markdown could be not as perfect in comparison to Docling. It has different heading levels instead of only '##'.

PyMuPDF4LLM Official Documentation - https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/

Upvotes: 1

Varghese PK
Varghese PK

Reputation: 21

As mentioned in the previous answers, PDF's aren't very easy to parse. However, if you have certain additional information regarding the text that you want to parse, you can pull it off.

  1. If your headings are positioned at specific parts of the page, you can parse the PDF file and sort the parsed output by coordinates.

  2. If you have prior knowledge of the spacing between headings and paragraphs, you could also leverage this information to parse the file.

PDFBox is a PDF parsing tool that you can use for extracting text and images on top of which you can define your custom rules for parsing.

However, for parsing PDFs you need to have some prior knowledge of the general format of the PDF file. You can check out the following blogpost Document parsing for more information regarding document parsing.

Disclaimer:I was involved in writing the blogpost.

Upvotes: 1

Eric Kim
Eric Kim

Reputation: 2696

PDF files can be parsed with tabula-py, or tabula-java.

I made a full tutorial on how to use tabula-py on this article. You can tabula in a web-browser too as long as you have installed Java.

Upvotes: 1

Vaibhav Panmand
Vaibhav Panmand

Reputation: 397

PDF parsing for headers and its sub contents are really very difficult (It doesn't mean its impossible ) as PDF comes in various formats. But I recently encountered with tool named GROBID which can helps in this scenario. I know it's not perfect but if we provide proper training it can accomplish our goals.

Grobid available as a opensource on github.

https://github.com/kermitt2/grobid

Upvotes: 5

KIBOU Hassan
KIBOU Hassan

Reputation: 389

iText api: PdfReader pr=new PdfReader("C:\test.pdf");

References: PDFReader

Upvotes: -12

Eugene
Eugene

Reputation: 2878

You may do use the following approach like this with iTextSharp or other open source libraries:

  • Read PDF file with with iTextSharp or similar open source tools and collect all text objects into an array (or convert PDF to HTML using the tool like pdftohtml and then parse HTML)
  • Sort all text objects by coordinates so you will have them all together
  • Then iterate through objects and check the distance between them to see if 2 or more objects can be merged into one paragraph or not

Or you may use the commercial tool like ByteScout PDF Extractor SDK that is capable of doing exactly this:

  • extract text and images along with analyzing the layout of the text
  • XML or CSV where text objects are merged or splitted into paragraphs inside a virtual layout grid
  • access objects via special API that makes it possible to address each object via its "virtual" row and column index disregarding how it is stored inside the original PDF.

Disclaimer: I am affiliated with ByteScout

Upvotes: 4

David van Driessche
David van Driessche

Reputation: 7056

There is essentially not an easy cut-and-paste solution because PDF isn't really very interested in structure. There are many other answers on this site that will tell you things in much more detail, but this one should give you the main points:

If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

If you want to do this in PDF itself (where you would have the majority of control over the process), you'll have to loop over all text on pages and identify headers by looking at their text properties (fonts used, size relative to the other text on the page, etc...).

On top of that you'll also have to identify paragraphs by looking at the positioning of text fragments, white space on the page, closeness of certain letters, words and lines... PDF by itself doesn't even have a concept for a "word", let alone "lines" or "paragraphs".

To complicate things even more, the way text is drawn on the page (and thus the order in which it appears in the PDF file itself) doesn't even have to be the proper reading order (or what us humans would consider to be proper reading order).

Upvotes: 26

markee174
markee174

Reputation:

Unless its is Marked Content, PDF does not have a structure.... You have to 'guess' it which is what the various tools are doing. There is a good blog post explaining the issues at http://blog.idrsolutions.com/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structured-content/

Upvotes: 0

Related Questions