Reputation: 2542
I have a large number of documents (mainly PDFs) that I want to index and query on.
I want to store all these docs in a filesystem structure by year.
I currently have this setup in Solr. But i have to run scripts to extract meta from the PDFs, then update the index.
Is there a product out there that basically lets me pop a new PDF into a folder and its auto indexed by Solr.
I have seen Alfresco does this, but its got some drawbacks - is there anything else along these lines.
Or would I use nutch to crawl my filesystem and post updates to Solr? Im not sure about how I should do this?
Upvotes: 1
Views: 402
Reputation: 1768
Solr is a search server not a crawler. As you noted, Nutch can do this (I have used it for a similar usecase, indexing a knowledgebase dump).
Essentially, you would host a webserver with the root of the folder structure as Document root. Then allow directory listing at this webserver. Nutch could then crawl the top level url of this document dump.
Once you have this Nutch created index, you can then expose it through solr as well.
Upvotes: 2