How to separate title and headers from body text in image

I am using tesseract (through the python wrapper) in order to extract text from documents. These documents do not include any images or tables, simply text.

Is there any option to distinguish the titles/headings from the text? Ideally I want to be able to have something like a xml tree rather than the full string chain (I do not need to have a visual of the document layout).

I found some third party tools that seem to be able to help but I was wondering if I can do it directly from tesseract.

Upvotes: 11

Answers (3)

Snigam

Reputation: 11

You can use a ml based solution but in such use cases I prefer to use light weight solutions which are based on opencv's features. You may use regular text detection and pair it with morphological transformations to detect header text.

Upvotes: 0

sohel shaikh

Reputation: 71

I am quite late to answer, but this answer might help others who are looking for a solution.

firstly, tesseract only wont be able to extract such "features" from the document. But all you need it a little bit of understanding of ML and vision libraries(like luminoth or detectronV2)

basically, you have to give some sample documents with mark-ups (like title, header1, header2 etc) and train the model. after training you can use the model on different unseen images to fetch such details.

Upvotes: 1

vencra

Reputation: 63

You can use Nanonets OCR api for create your own model that seperates headings and text or you can add different labels.

Upvotes: 1

How to separate title and headers from body text in image

Answers (3)

Related Questions