Reputation: 2247

How to get consistent results in tabular PDF parsing with llama-parse?

I was parsing some PDF files using llama in Python with below code:

import os
import pandas as pd

import nest_asyncio 
nest_asyncio.apply()

os.environ["LLMA_CLOUD_API_KEY"] = "some_key_id"
key_input = "some_key_id"

from llama_parse import LlamaParse

# running llama parsing
doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
                        ).load_data(r"Path\myfile.pdf")

The results of parsing the same document is different when I run this same code now from then. Difference is of | and line separation for the separations in tabular text.

Is there a way to get the same old results in llama or to fix some parameters so that it works on same model or same way to always get same consistent results again & again so that I can build Analytics on this based on same code logic?

Last month's llama results:

print(doc_parsed[5].text[:1000])

# Information

|Name|: Mr. XXX|
|---|---|
|Age/Sex|: XX YRS/M|
|Lab Id.|: 0124080X|
|Refered By|: Self|
|Sample Collection On|: 03/Aug/2024 08:30AM|
|Collected By|: XXX|
|Sample Lab Rec. On|: 03/Aug/2024 11:50 AM|
|Collection Mode|: HOME COLLECTION|
|Reporting On|: 03/Aug/2024 02:48 PM|
|BarCode|: XXX|

# Test Results

|Test Name|Result|Biological Ref. Int.|Unit|
|---|---|---|---|

Llama results on same PDF now:

print(doc_parsed[5].text[:1000])

# Report

Name: Mr. XXX

Age/Sex: XXX YRS/M

Lab Id: 0124080X

Referred By: Self

Sample Collection On: 03/Aug/2024 08:30 AM

Collected By: XXX

Sample Lab Rec. On: 03/Aug/2024 11:50 AM

Collection Mode: HOME COLLECTION

Reporting On: 03/Aug/2024 02:48 PM

BarCode: XXX

# Test Results

Test Name
Result
Biological Ref. Int.
Unit

Desired Results:

# Above part doesn't matter but Test Results should be separated by |
# Test Results

|Test Name|Result|Biological Ref. Int.|Unit|

Is there a change of model at the back causing difference? Can I fix the model to get the consistent results?

Upvotes: 1

Answers (2)

Inkyu Kim

Reputation: 155

I know it has been a while, but my solution has been more of the following. I am also assuming that you are doing this for RAG purpose, NOT for the sake of getting the right parsing. Instead of getting the exact parsing of the PDF, I get a rough parsing where I get all the texts from the table, and I just replace the table with some metadata like {table id: 'werwrwe', summary: "contains the values of collection date, barcode etc", page: 12}. Then during the retrieval, if the similarity search chooses this chunk, then using the table id, I can retrieve the page of the document as a picture. Basically I see table_id and the page, then I can go to the original document then get the raw table as the picture. This has been much better RAG process for me rather than trying to get the parsing absolutely correct.

Upvotes: 0

ViSa

Reputation: 2247

By Adding the parsing_instruction and asking it to create tabular data separated by | has provided some form of help to create desired Results but I am not sure if the results will remain consistent over the time by using instructions.

Other Answers are also welcome and I am open for better Answers.

# parsing instruction
parsingInstruction2 = """The provided document is a Report.
It should contain tables.
Try to reconstruct the table data into four columns each seperated by |."""

# parse function
doc_parsed_13Sep2 = LlamaParse(result_type="markdown",api_key=key_input, 
                                  parsing_instruction=parsingInstruction2
                        ).load_data(r"Path\myfile.pdf")

output:

# Report

table {
width: 100%;
border-collapse: collapse;
}
th, td {
border: 1px solid black;
padding: 8px;
text-align: left;
}
th {
background-color: #f2f2f2;
}


Name: Mr. XXX

Age/Sex: XXX YRS/M

Lab Id: 0124080X

Referred By: Self

Sample Collection On: 03/Aug/2024 08:30AM

Collected By: XXX

Sample Lab Rec. On: 03/Aug/2024 11:50 AM

Collection Mode: HOME COLLECTION

Reporting On: 03/Aug/2024 03:24 PM

BarCode: XXX

# Test Results

|Test Name|Result|Biological Ref. Int.|Unit|
|---|---|---|---|
|BLOOD UREA|31.80|12-43|mg/dL|
|BLOOD UREA NITROGEN (BUN)|15|6 - 21|mg/dl|
|SERUM CREATININE|1.10|0.9 - 1.3|mg/dL|
|SERUM URIC ACID|5.8|3.5-7.2|mg/dL|
|UREA / CREATININE RATIO|28.91|23 - 33|Ratio|
|BUN / CREATININE RATIO|13.51|5.5 - 19.2|Ratio|
|INORGANIC PHOSPHORUS|3.63|2.5-4.5|mg/dL|

UPDATE - Instructions updated fields separated by |

parsingInstruction3 = """The provided document is a Report.
It should contain tables.
Try to reconstruct the data with fields seperated by |."""

output:

# TEST REPORT

|Name|Mr. XXX|
|---|---|
|Age/Sex|XXX YRS/M|
|Lab Id.|0124080X|
|Referred By|Self|
|Sample Collection On|03/Aug/2024 08:30 AM|
|Collected By|XXX|
|Sample Lab Rec. On|03/Aug/2024 11:50 AM|
|Collection Mode|HOME COLLECTION|
|Reporting On|03/Aug/2024 03:24 PM|
|BarCode|XXX|

# Test Results

|Test Name|Result|Biological Ref. Int.|Unit|
|---|---|---|---|
|BLOOD UREA|31.80|12-43|mg/dL|
|BLOOD UREA NITROGEN (BUN)|15|6 - 21|mg/dL|
|SERUM CREATININE|1.10|0.9 - 1.3|mg/dL|
|SERUM URIC ACID|5.8|3.5-7.2|mg/dL|
|UREA / CREATININE RATIO|28.91|23 - 33|Ratio|
|BUN / CREATININE RATIO|13.51|5.5 - 19.2|Ratio|
|INORGANIC PHOSPHORUS|3.63|2.5-4.5|mg/dL|

Upvotes: 0

How to get consistent results in tabular PDF parsing with llama-parse?

Answers (2)

Related Questions