Reputation: 5394
I'm using Lucene.NET 4.8-beta00005.
I have a "name" field in my documents defined as follows:
doc.Add(CreateField(NameField, entry.Name.ToLower()));
writer.AddDocument(doc);
Where CreateField
is implemented as follows
private static Field CreateField(string fieldName, string fieldValue)
{
return new Field(fieldName, fieldValue, new FieldType() {IsIndexed = true, IsStored = true, IsTokenized = true, StoreTermVectors = true, StoreTermVectorPositions = true, StoreTermVectorOffsets = true, StoreTermVectorPayloads = true});
}
The "name" field is assigned a StandardAnalyzer
.
Then in my CustomScoreProvider
I'm retriving the terms from the term vector as follows:
private List<string> GetDocumentTerms(int doc, string fieldName)
{
var indexReader = m_context.Reader;
var termVector = indexReader.GetTermVector(doc, fieldName);
var termsEnum = termVector.GetIterator(null);
BytesRef termBytesRef;
termBytesRef = termsEnum.Next();
var documentTerms = new List<string>();
while (termBytesRef != null)
{
//removing trailing \0 (padded to 16 bytes)
var termText = Encoding.Default.GetString(termBytesRef.Bytes).Replace("\0", "");
documentTerms.Add(termText);
termBytesRef = termsEnum.Next();
}
return documentTerms;
}
Now I have a document where the value of the "name" field is "dan gertler diamonds ltd."
So the terms from the term vector I'm expecting are:
dan gertler diamonds ltd
But my GetDocumentTerms
gives me the following terms:
dan diamonds gertlers ltdtlers
I'm using as StandardAnalyzer
with the field so I'm not expecting it to do much transformation to the orignal words in the field (and I did check with this particular name and StandardAnalyzer).
What am I doing wrong here and how to fix it?
Edit: I'm extracing the terms manually with each field's Analyzer and stroing the them in a separate String field as a workaroud for now.
Upvotes: 0
Views: 66
Reputation: 142
If you want to get the terms in correct order, you must also use the positional information. Test this code:
Terms terms = indexReader.GetTermVector(doc, fieldName);
if (terms != null)
{
var termIterator = terms.GetIterator(null);
BytesRef bytestring;
var documentTerms = new List<Tuple<int, string>>();
while ((bytestring = termIterator.Next()) != null)
{
var docsAndPositions = termIterator.DocsAndPositions(null, null, DocsAndPositionsFlags.OFFSETS);
docsAndPositions.NextDoc();
int position;
for(int left = docsAndPositions.Freq; left > 0; left--)
{
position = docsAndPositions.NextPosition();
documentTerms.Add(new Tuple<int, string>(position, bytestring.Utf8ToString()));
}
}
documentTerms.Sort((word1, word2) => word1.Item1.CompareTo(word2.Item1));
foreach (var word in documentTerms)
{
Console.WriteLine("{0} {1} {2}", fieldName, word.Item1, word.Item2);
}
}
This code also handles the situation where you have the same term (word) in more than one place.
Upvotes: 1