Parse tables in PDF file (when RAG)

I'm working on an application like chatPDF by using LLM with RAG. I face a problem that I can't find a python library to parse one pdf file which includes some "complex" tables. e.g.

I have tried llamaIndex(SimpleDirectoryReader), and "unstructured" library, only obtaining the text as follows:

SimpleDirectoryReader --- "Peripheral STM32L475Vx STM32L475Rx Flash memory 256KB 512KB 1MB 256KB 512KB 1MB"

unstructured --- "Peripheral STM32L475Vx STM32L475Rx Flash memory 256KB 512KB 1MB 256KB 512KB 1MB SRAM 128 KB"

those texts lose the structure relationship bettween products and parameters(e.g. STM32L475Vx is for the first of "256KB 512KB 1MB")

Upvotes: -1

Answers (2)

Paul Matthews

Reputation: 41

Llamaparse is what you need, converts PDFs to markup: https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/

Upvotes: 0

Eric Vaillancourt

Reputation: 81

I have been researching for a project to analyze invoices and needed to retain the information like po number, client name, address and product description.

I was able to do this in a test environment with this : https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/documentintelligence/azure-ai-documentintelligence/samples/sample_analyze_receipts.py

I’m currently developing a class that will convert the results to a langchain document objet for use downstream with a RAG application for the creation of embeddings.

I will share when available.

Hope this helps!

Upvotes: 0

Parse tables in PDF file (when RAG)

Answers (2)

Related Questions