Best-suited text indexer for handling 10000s of (formatted) documents in python

Question

I want to add a feature to search documents stored in a directory. The back end is developed in Python to additionally manipulate the search results. The documents are stored in a dedicated web server.

The established technologies (Lucene, Xapian, Whoosh) have mature python bindings. My colleagues have set up Apache, Lucene and PHP for their clients. I would choose Whoosh for being written in Python, but I am scared by reviews of its slow performance and lack of "feature X".

My specific requirements are:

Support (makes me bite my nails)

well supported in Python
tech support of major hosts can easily set it up
scales well for upto 100000 documents
updating the index for 4 new files shouldnt slow down our dedicated server

Features (I am a newb here)

returns data in a format which I can manipulate by myself
can return highlighted text snippets
higher priority for certain files and words in title or bold

Toofan · Accepted Answer

Solr, even though written in Java is an amazingly powerful search engine.

It has everything you need like highlighting, weight, ability to insert new items in the index relatively fast, and also a whole slew of other features like ability to provide autocomplete-like features.

It has json / xml / other response methonds, and a fairly good way in python to the search engine.

Best-suited text indexer for handling 10000s of (formatted) documents in python

Answers (2)

Related Questions