Reputation: 958
I am using tesseract (through the python wrapper) in order to extract text from documents. These documents do not include any images or tables, simply text.
Is there any option to distinguish the titles/headings from the text? Ideally I want to be able to have something like a xml tree rather than the full string chain (I do not need to have a visual of the document layout).
I found some third party tools that seem to be able to help but I was wondering if I can do it directly from tesseract.
Upvotes: 11
Views: 4748
Reputation: 11
You can use a ml based solution but in such use cases I prefer to use light weight solutions which are based on opencv's features. You may use regular text detection and pair it with morphological transformations to detect header text.
Upvotes: 0
Reputation: 71
I am quite late to answer, but this answer might help others who are looking for a solution.
firstly, tesseract only wont be able to extract such "features" from the document. But all you need it a little bit of understanding of ML and vision libraries(like luminoth or detectronV2)
basically, you have to give some sample documents with mark-ups (like title, header1, header2 etc) and train the model. after training you can use the model on different unseen images to fetch such details.
Upvotes: 1