Cristian Lehuede Lyon
Cristian Lehuede Lyon

Reputation: 1947

NoSQL for searching millions of pages?

I've been provided with aprox 4-5 million images of old documents my company has decided to delete. We're trying to go paperless but I'm facing an issue I've been unable to fully comprehend. I've always used SQL for this amount of data but now I only have images. I've already bought ABBYY Fine Reader OCR and it's currently working on OCRing all the files to Word or PDF. My problem is they'd like to search within this massive amount of data in less than 7-10 seconds and get all the results with a download link to the original image of the file.

I read about NoSQL but it seems to me it's not the best approach as I'd have to create a table with no schema whatsoever and just add the entire text of each image with a corresponding Page number and a link to the original file. According to my knowledge this will take ages. What other solutions can I use?

Upvotes: 1

Views: 321

Answers (1)

Didier Spezia
Didier Spezia

Reputation: 73226

To support searching over a set a documents, building a reverse index is generally the best solution. Here I assume you want to support full text fast search operations such as the ones provided by Google, Bing, etc... but on your data.

Building a reverse index generally involves splitting the documents into words, and add them individually into the reverse index. Each index entry will include a word as a key, and the document name (or some other identifier of the document), and locations of the word in the document as a value.

You can do this manually, but it is not so trivial to parse documents, extract words, eliminate non significant words, and index them. It is easier to use a dedicated product.

Most RDBMS supports extensions providing fulltext indexing capabilities. For instance:

Generally, these RDBMS extensions are less efficient than specialized engines. I would recommend one of the following products:

I think any of these products can index a few millions of documents.

Upvotes: 1

Related Questions