Jesvin Jose
Jesvin Jose

Reputation: 23098

Best-suited text indexer for handling 10000s of (formatted) documents in python

I want to add a feature to search documents stored in a directory. The back end is developed in Python to additionally manipulate the search results. The documents are stored in a dedicated web server.

The established technologies (Lucene, Xapian, Whoosh) have mature python bindings. My colleagues have set up Apache, Lucene and PHP for their clients. I would choose Whoosh for being written in Python, but I am scared by reviews of its slow performance and lack of "feature X".

My specific requirements are:

Support (makes me bite my nails)

Features (I am a newb here)

Upvotes: 0

Views: 184

Answers (2)

Toofan
Toofan

Reputation: 150

Solr, even though written in Java is an amazingly powerful search engine.

It has everything you need like highlighting, weight, ability to insert new items in the index relatively fast, and also a whole slew of other features like ability to provide autocomplete-like features.

It has json / xml / other response methonds, and a fairly good way in python to the search engine.

Upvotes: 1

禪 師 無
禪 師 無

Reputation: 422

Sphinx is pretty easy to interact with because it works via a MySQL storage engine, which is an interface most programmers have touched at one point or another. Doubly so if you already have data in MySQL because then you can munge the data together trivially. Django-sphinx is an example of a fairly mature and easy to use means of interacting with Sphinx.

I know it's performant because I've used it in some high-load high-traffic situations and it's done very well. Supports all the semantics/features that I've ever found myself to need.

Lucene can be made more tolerable with Solr which is a REST interface to Lucene. The native bindings can be a bit arcane/alien to people not used to interacting with a search engine.

Upvotes: 1

Related Questions