Indexing many incoming files

Question

I am trying to write a full text search application which indexes nearly 10000 incoming files every 5 mins. Now before anyone suggests Lucene, Solr, Sphinx, ElasticSearch etc I am not allowed to use either of these. So I am basically trying to read up on building an index. And specifically I am restricted to use MySQL(or any other RDBMS) to store the index(not the file).

Now from what very little I understood about Lucene is that at its core runs an inverted index. I am trying to replicate that by creating a database of word and their corresponding files containing them.(Again I cant use a document which Lucene uses)

I am running a cron job which checks every 5 mins if new files have been uploaded and put them onto a queue. With respect to the queue a Java code runs which creates the index and stores it in the mysql table. All of that FCFS is fine when we are dealing with a few files. But with a load of 10000 files coming in every 5 mins, the indexing will take a lot of time. So is it optimal to spawn a thread every time new files are pushed? That would result in multiple thousand threads running on my server which is already performing other tasks. What would be the best way to handle this task?

Another query I have is: From what I read I understand Lucene uses skip lists to store the list of documents containing words. Something like this: http://4.bp.blogspot.com/-aAvEQEILnEc/USeg8wgdBqI/AAAAAAAAA-s/1D9sNkwVwkk/s1600/p1.png

However due to the usage of MySQL I cannot use skip list and instead have to denormalize and face a lot of redundancy. Any way to sort that out?

Indexing many incoming files

Answers (1)

Related Questions