How to identify and extract an image from a magazin page?

Question

I have the challenge to scan pages from 100-year-old magazines and making them accessible to a wider audience. I have to extract the texts (a typical OCR task) and separate out the images and save them in separate files.

I am using C# and Tesseract to extract texts - that's pretty easy and works (almost) perfect. I'll leave that for the time being...

Now I want to automatically extract the images - and am struggling.

Currently I am using C# and AForge.net to apply filters and operators, but none of them have produced any tangible results.

I figured that one of the challenges I face is the fact, that pictures are using some dithering (or similar algorithm) to emulate gray scales (as it was used in the beginning of then 1900ths). Hence converting the scanned image into gray scale does only reduce the pixel format (from 32bit of the scanned jpeg, into 8bit to represent grays). I tried other things like enhancing contrast and/or brightness, binarization to separate foreground from background, but the results were all the same: gazillions of individual pixels... but not a single segment, that can be extracted as an image.

Here's an example of such a page:

I hope, someone has some clever hints for me.

How to identify and extract an image from a magazin page?

Answers (0)

Related Questions