Reputation: 2247
I was parsing some PDF files using llama in Python with below code:
import os
import pandas as pd
import nest_asyncio
nest_asyncio.apply()
os.environ["LLMA_CLOUD_API_KEY"] = "some_key_id"
key_input = "some_key_id"
from llama_parse import LlamaParse
# running llama parsing
doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
).load_data(r"Path\myfile.pdf")
The results of parsing the same document is different when I run this same code now from then. Difference is of |
and line separation for the separations in tabular text.
Is there a way to get the same old results in llama or to fix some parameters so that it works on same model or same way to always get same consistent results again & again so that I can build Analytics on this based on same code logic?
Last month's llama results:
print(doc_parsed[5].text[:1000])
# Information
|Name|: Mr. XXX|
|---|---|
|Age/Sex|: XX YRS/M|
|Lab Id.|: 0124080X|
|Refered By|: Self|
|Sample Collection On|: 03/Aug/2024 08:30AM|
|Collected By|: XXX|
|Sample Lab Rec. On|: 03/Aug/2024 11:50 AM|
|Collection Mode|: HOME COLLECTION|
|Reporting On|: 03/Aug/2024 02:48 PM|
|BarCode|: XXX|
# Test Results
|Test Name|Result|Biological Ref. Int.|Unit|
|---|---|---|---|
Llama results on same PDF now:
print(doc_parsed[5].text[:1000])
# Report
Name: Mr. XXX
Age/Sex: XXX YRS/M
Lab Id: 0124080X
Referred By: Self
Sample Collection On: 03/Aug/2024 08:30 AM
Collected By: XXX
Sample Lab Rec. On: 03/Aug/2024 11:50 AM
Collection Mode: HOME COLLECTION
Reporting On: 03/Aug/2024 02:48 PM
BarCode: XXX
# Test Results
Test Name
Result
Biological Ref. Int.
Unit
Desired Results:
# Above part doesn't matter but Test Results should be separated by |
# Test Results
|Test Name|Result|Biological Ref. Int.|Unit|
Is there a change of model at the back causing difference? Can I fix the model to get the consistent results?
Upvotes: 1
Views: 368
Reputation: 155
I know it has been a while, but my solution has been more of the following. I am also assuming that you are doing this for RAG purpose, NOT for the sake of getting the right parsing. Instead of getting the exact parsing of the PDF, I get a rough parsing where I get all the texts from the table, and I just replace the table with some metadata like {table id: 'werwrwe', summary: "contains the values of collection date, barcode etc", page: 12}. Then during the retrieval, if the similarity search chooses this chunk, then using the table id, I can retrieve the page of the document as a picture. Basically I see table_id and the page, then I can go to the original document then get the raw table as the picture. This has been much better RAG process for me rather than trying to get the parsing absolutely correct.
Upvotes: 0
Reputation: 2247
By Adding the parsing_instruction
and asking it to create tabular data separated by |
has provided some form of help to create desired Results but I am not sure if the results will remain consistent over the time by using instructions
.
Other Answers are also welcome and I am open for better Answers.
# parsing instruction
parsingInstruction2 = """The provided document is a Report.
It should contain tables.
Try to reconstruct the table data into four columns each seperated by |."""
# parse function
doc_parsed_13Sep2 = LlamaParse(result_type="markdown",api_key=key_input,
parsing_instruction=parsingInstruction2
).load_data(r"Path\myfile.pdf")
output:
# Report
table {
width: 100%;
border-collapse: collapse;
}
th, td {
border: 1px solid black;
padding: 8px;
text-align: left;
}
th {
background-color: #f2f2f2;
}
Name: Mr. XXX
Age/Sex: XXX YRS/M
Lab Id: 0124080X
Referred By: Self
Sample Collection On: 03/Aug/2024 08:30AM
Collected By: XXX
Sample Lab Rec. On: 03/Aug/2024 11:50 AM
Collection Mode: HOME COLLECTION
Reporting On: 03/Aug/2024 03:24 PM
BarCode: XXX
# Test Results
|Test Name|Result|Biological Ref. Int.|Unit|
|---|---|---|---|
|BLOOD UREA|31.80|12-43|mg/dL|
|BLOOD UREA NITROGEN (BUN)|15|6 - 21|mg/dl|
|SERUM CREATININE|1.10|0.9 - 1.3|mg/dL|
|SERUM URIC ACID|5.8|3.5-7.2|mg/dL|
|UREA / CREATININE RATIO|28.91|23 - 33|Ratio|
|BUN / CREATININE RATIO|13.51|5.5 - 19.2|Ratio|
|INORGANIC PHOSPHORUS|3.63|2.5-4.5|mg/dL|
UPDATE - Instructions updated fields separated by |
parsingInstruction3 = """The provided document is a Report.
It should contain tables.
Try to reconstruct the data with fields seperated by |."""
output:
# TEST REPORT
|Name|Mr. XXX|
|---|---|
|Age/Sex|XXX YRS/M|
|Lab Id.|0124080X|
|Referred By|Self|
|Sample Collection On|03/Aug/2024 08:30 AM|
|Collected By|XXX|
|Sample Lab Rec. On|03/Aug/2024 11:50 AM|
|Collection Mode|HOME COLLECTION|
|Reporting On|03/Aug/2024 03:24 PM|
|BarCode|XXX|
# Test Results
|Test Name|Result|Biological Ref. Int.|Unit|
|---|---|---|---|
|BLOOD UREA|31.80|12-43|mg/dL|
|BLOOD UREA NITROGEN (BUN)|15|6 - 21|mg/dL|
|SERUM CREATININE|1.10|0.9 - 1.3|mg/dL|
|SERUM URIC ACID|5.8|3.5-7.2|mg/dL|
|UREA / CREATININE RATIO|28.91|23 - 33|Ratio|
|BUN / CREATININE RATIO|13.51|5.5 - 19.2|Ratio|
|INORGANIC PHOSPHORUS|3.63|2.5-4.5|mg/dL|
Upvotes: 0