陳俊方
陳俊方

Reputation: 1

PDF parsing "MinerU" for layout analysis, may lost equations. Lead to RAG incorrect answers

The Retrieval-Augmented Generation (RAG) data pre-processing using "MinerU" for PDF parsing.

I noticed that some equations are lost layouts in markdown result (like the .png (1) and (3) equations). The 5 equations are expected, but just 3 equations are be marked in green background. enter image description here So the markdown result just 3 equations. enter image description here This leading to incorrect answers during RAG.

I manual inspection this dataset and results may not be as expected, but when the large volume of PDF files, it's impractical to manually check each document.

So, How can I improve layout detection to address this issue? Are there alternative models recommended for layout analysis?

Upvotes: 0

Views: 9

Answers (0)

Related Questions