user949300
user949300

Reputation: 15729

Lucene 4.1 is ignoring FieldType.tokenized = false

I'm using Lucene 4.1 to index keyword/value pairs, where the keywords and values are not real words - i.e., they are voltages, settings, that should not be analyzed or tokenized. e.g. $P14R / 16777216. (this is FCS data for any Flow Cytometrists out there)

For indexing, I create a FieldType with indexed = true, stored = true, and tokenized = false. These mimic the ancient Field.Keyword from Lucene 1, for which I have the book. :-) I even freeze the fieldType.

I see these values in the debugger. I create the document and index.

When I read the index and document and look at the Fields in the debugger, I see all my fields. The names and fieldsData look correct. However, the FieldType is wrong. It shows indexed = true, stored = true, and tokenized = true. The result is that my searches (using a TermQuery) do not work.

How can I fix this? Thanks.

p.s. I am using a KeywordAnalyzer in the IndexWriterConfig. I'll try to post some demo code later, but it's off to my real job for today. :-)

DEMO CODE:

public class LuceneDemo {

  public static void main(String[] args) throws IOException {
    Directory lDir = new RAMDirectory();
    Analyzer analyzer = new KeywordAnalyzer();
    IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_41, analyzer);
    iwc.setOpenMode(OpenMode.CREATE);
    IndexWriter writer = new IndexWriter(lDir, iwc);

    // BTW, Lucene, anyway you could make this even more tedious???
    // ever heard of builders, Enums, or even old fashioned bits?
    FieldType keywordFieldType = new FieldType();
    keywordFieldType.setStored(true);
    keywordFieldType.setIndexed(true);
    keywordFieldType.setTokenized(false);

    Document doc = new Document();
    doc.add(new Field("$foo",  "$bar123", keywordFieldType));
    doc.add(new Field("contents",  "$foo=$bar123", keywordFieldType));
    doc.add(new Field("$foo2",  "$bar12345", keywordFieldType));

    Field onCreation = new Field("contents",  "$foo2=$bar12345", keywordFieldType);
    doc.add(onCreation);
    System.out.println("When creating, the field's tokenized is " + onCreation.fieldType().tokenized());
    writer.addDocument(doc);
    writer.close();

    IndexReader reader = DirectoryReader.open(lDir);
    Document d1 = reader.document(0);
    Field readBackField = (Field) d1.getFields().get(0);
    System.out.println("When read back the field's tokenized is " + readBackField.fieldType().tokenized());

    IndexSearcher searcher = new IndexSearcher(reader);

// exact match works
Term term = new Term("$foo", "$bar123" );
    Query query = new TermQuery(term);
    TopDocs results = searcher.search(query, 10);
    System.out.println("when searching for : " + query.toString() + "  hits = " + results.totalHits);

    // partial match fails
    term = new Term("$foo", "123" );
    query = new TermQuery(term);
    results = searcher.search(query, 10);
    System.out.println("when searching for : " + query.toString() + "  hits = " + results.totalHits);

    // wildcard search works
    term = new Term("contents", "*$bar12345" );
    query = new WildcardQuery(term);
    results = searcher.search(query, 10);
    System.out.println("when searching for : " + query.toString() + "  hits = " + results.totalHits);
    }
}

output will be:

When creating, the field's tokenized is false
When read back the field's tokenized is true
when searching for : $foo:$bar123  hits = 1
when searching for : $foo:123  hits = 0
when searching for : contents:*$bar12345  hits = 1

Upvotes: 1

Views: 1509

Answers (3)

Lucene stores all tokens in lower case - hence, you need to convert your search strings to lower case first for non-tokenized fields.

Upvotes: 1

user949300
user949300

Reputation: 15729

The demo code proves that the value for tokenized is different when you read it back. Not sure if that is a bug or not.

But that isn't why the partial search doesn't work. The partial search doesn't work cause Lucene doesn't do partial searches (unless you use Wildcard) e.g. Says so here in StackOverflow

Been using Google so long I guess I didn't understand that. :-)

Upvotes: 0

Emanuele Bezzi
Emanuele Bezzi

Reputation: 1728

You can try to use a KeywordAnalyzer for the fields you don't want to tokenize.

If you need multiple analyzers (that is, if you have other fields that need tokenization), PerFieldAnalyzerWrapper is the way.

Upvotes: 1

Related Questions