Reputation: 1362
I am new to Lucene, so maybe this is a techical limit i dont understand.
I have indexed few text and the try to fetch the content.
If i query this text open-source reciprocal productivity
with the query source
i get a match.
If i sue the query sour
i also gret a match. But if i use the query sou
then i don't get any match.
I am using Lucene .Net version 4.8 Here the code i am using to creating index :
using (var dir = FSDirectory.Open(targetDirectory))
Analyzer analyzer = metadata.GetAnalyzer() ; //return new StandardAnalyzer(LuceneVersion.LUCENE_48);
var indexConfig = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer);
using (IndexWriter writer = new IndexWriter(dir, indexConfig))
long entryNumber = csvRecords.Count();
long index = 0;
long lastPercentage = 0;
foreach (dynamic csvEntry in csvRecords)
Document doc = new Document();
IDictionary<string, object> dynamicCsvEntry = (IDictionary<string, object>)csvEntry;
var indexedMetadataFiled = metadata.IdexedFields;
foreach (string headField in header)
if (indexedMetadataFiled.ContainsKey(headField) == false || (indexedMetadataFiled[headField].NeedToBeIndexed == false && indexedMetadataFiled[headField].NeedToBeStored == false))
var field = new Field(headField,
((string)dynamicCsvEntry[headField] ?? string.Empty).ToLower(),
indexedMetadataFiled[headField].NeedToBeStored ? Field.Store.YES : Field.Store.NO, //YES
indexedMetadataFiled[headField].NeedToBeIndexed ? Field.Index.ANALYZED : Field.Index.NO //YES
long percentage = (long)(((decimal)index / (decimal)entryNumber) * 100m);
if ( percentage > lastPercentage && percentage % 10 == 0)
_consoleLogger.Information($"..indexing {percentage}%..");
lastPercentage = percentage;
And here the code i sue to query the index :
var tokens = Regex.Split(query.Trim(), @"\W+");
BooleanQuery composedQuery = new BooleanQuery();
foreach (var field in luceneHint.FieldsToSearch)
foreach (string word in tokens)
if (string.IsNullOrWhiteSpace(word))
var termQuery = new FuzzyQuery(new Term(field.FieldName, word.ToLower() ));
termQuery.Boost = (float)field.Weight;
composedQuery.Add(termQuery, Occur.SHOULD);
var indexManager = IndexManager.Instance;
ReferenceManager<IndexSearcher> index = indexManager.Read(boundle);
int resultLimit = luceneHint?.Top ?? RESULT_LIMIT;
var results = new List<JObject>();
var searcher = index.Acquire();
Dictionary<string, FieldDescriptor> filedToRead = (luceneHint?.FieldsToRead?.Any() ?? false) ?
luceneHint.FieldsToRead.ToDictionary(item => item.FieldName, item => item) :
new Dictionary<string, FieldDescriptor>();
bool fetchEveryField = filedToRead.Count == 0;
TopScoreDocCollector collector = TopScoreDocCollector.Create(resultLimit, true);
int startPageIndex = pageIndex * itemsPerPage;
searcher.Search(composedQuery, collector);
//TopDocs topDocs = searcher.Search(composedQuery, luceneHint?.Top ?? 100);
TopDocs topDocs = collector.GetTopDocs(startPageIndex, itemsPerPage);
foreach (var scoreDoc in topDocs.ScoreDocs)
Document doc = searcher.Doc(scoreDoc.Doc);
dynamic result = new JObject();
foreach (var field in doc.Fields)
if (fetchEveryField || filedToRead.ContainsKey(field.Name))
result[field.Name] = field.GetStringValue();
if ( searcher != null )
return results;
I am confused, is the fact the i cant get resoult for sou
query relate to the fact that the StandardAnalyzer that is used to build the index, use a some stop-word that prevent my query term to be found in the index? (the index stop ad source
and sour
because those are both english words)
Ps : here the explain plot, even if i don't know how to use it :
searcher.Explain(composedQuery,6) {0 = (NON-MATCH) sum of: } Description: "sum of:" IsMatch: false Match: false Value: 0
Upvotes: 0
Views: 79
Reputation: 111
The documentation for FuzzyQuery points out that it uses the default minimumSimilarity value of 0.5:
minimumSimilarity - a value between 0 and 1 to set the required similarity between the query term and the matching terms. For example, for a minimumSimilarity of 0.5 a term of the same length as the query term is considered similar to the query term if the edit distance between both terms is less than length(term) * 0.5
So, it matches "source" when the query is "sour", because removing "ce" requires two edits, the edit distance is 2, and that's <= than length("sour") * 0.5. However, matching "source" to "sou" would need 3 edits, and so it's not a match.
You should be able to see the same document matching even if you search for something like "bounce" or "sauce", since those are also within two edits from "source".
Upvotes: 1