Reputation: 6091
I have very many folders with a large number of image files in there. Occasionally a scanned document image ends up in a folder by accident and short of someone visually scanning the folder, these remain undetected but could cause problems if published to the wrong location.
Since they could have been scanned as any file type and sizes are broadly in the range of the genuine images, they are very hard to detect from metadata.
Does anyone know of a way to detect a scanned document from a genuine image - either a tool or a programmatic way?
Upvotes: 0
Views: 2367
Reputation: 1642
Can there be other text-on-background images in the folders? Are large pictures common in these scanned documents? One non-foolproof way of filtering mostly text documents out of a non-simple image haystack would be to high-pass the images based on Shannon's (histogram) entropy. Most images have entropy values an order of magnitude above simple documents.
Upvotes: 0
Reputation: 28950
Assuming that scanned documents will look like documents any image processing library should do. You simply have to pick a few features to sort out anything that is not a document. Apply some basic classification or machine learning using these features.
The few remaining files can either be checked by a human or using some ORC. I would not run OCR on all files as it will take more computation time than a simple classification.
Documents (especially the confidential ones) tend to have a bright background with high frequency dark foreground. The dark stuff is grouped in lines. There are little to no colours and if those colours usually are only at a small fraction of the document (logos and such) I can't think of many images that share those properties.
So unless you have a lot of pictures of newspapers and books in your collection you are fine.
Of course scanners and cameras have different imaging properties and optical aberrations and I'm sure you can find some of them in the files but that won't work for all images. Especially not if those images were cropped from bigger ones.
Upvotes: 1
Reputation: 664
I would recommend taking a look at the Accord Framework: http://accord-framework.net/. Check out the Computer Vision features. I think it should be up to the task you are describing, plus it is a fun new area to learn. Good luck.
Upvotes: 4