PDF parsing "MinerU" for layout analysis, may lost equations. Lead to RAG incorrect answers

Question

The Retrieval-Augmented Generation (RAG) data pre-processing using "MinerU" for PDF parsing.

I noticed that some equations are lost layouts in markdown result (like the .png (1) and (3) equations). The 5 equations are expected, but just 3 equations are be marked in green background. enter image description here So the markdown result just 3 equations. enter image description here This leading to incorrect answers during RAG.

I manual inspection this dataset and results may not be as expected, but when the large volume of PDF files, it's impractical to manually check each document.

So, How can I improve layout detection to address this issue? Are there alternative models recommended for layout analysis?

PDF parsing "MinerU" for layout analysis, may lost equations. Lead to RAG incorrect answers

Answers (0)

Related Questions

PDF parsing &quot;MinerU&quot; for layout analysis, may lost equations. Lead to RAG incorrect answers

Answers (0)

Related Questions

PDF parsing "MinerU" for layout analysis, may lost equations. Lead to RAG incorrect answers