Reputation: 1695
I want to do general substring search among billions of strings. The requirement is a little different from general fulltext search because I want a query "ubst" also can hit "substr".
Is Lucene or Sphinx capable of doing this? If not, what's the best way do you think to do this?
Upvotes: 8
Views: 3165
Reputation: 12573
sqlite has a full-text extension called fts5. It looks stable and free.
Upvotes: 0
Reputation: 7866
Sphinx does support effective substring searches since Version 2.0.1-beta, 22 apr 2011. Unfortunately as of today this support regards only beta versions, as mentioned here.
I made a try with 2.1.1 beta version. It seems to work correctly. See the manual entry for dictionary type, read about keywords
type.
When I tried to use 2.0.6 release version, it fell back to inefficient crc index, giving the following warning during indexing:
WARNING: min_infix_len is not supported yet with dict=keywords; using dict=crc
My minimal configuration file:
source sour
{
type = xmlpipe2
xmlpipe_command = type C:\Temp\1\sphinx\input.xml
}
index inde
{
source = sour
path = testpa
enable_star = 1
dict = keywords
charset_type = utf-8
min_infix_len = 1
}
Upvotes: 1
Reputation: 14645
Best index structure for this case is suffix tree Lucene does not implements this type of index so its substring search is slow. But lucene has prefix tree index which mean you can do fast search if you search terms by their prefix.
Upvotes: 4