Reputation: 1
The Retrieval-Augmented Generation (RAG) data pre-processing using "MinerU" for PDF parsing.
I noticed that some equations are lost layouts in markdown result (like the .png (1) and (3) equations). The 5 equations are expected, but just 3 equations are be marked in green background. enter image description here So the markdown result just 3 equations. enter image description here This leading to incorrect answers during RAG.
I manual inspection this dataset and results may not be as expected, but when the large volume of PDF files, it's impractical to manually check each document.
So, How can I improve layout detection to address this issue? Are there alternative models recommended for layout analysis?
Upvotes: 0
Views: 9