Reputation: 47
I want to migrate an example from the book "Lucene in Action 2nd Edition", which is based on Lucene 3.0, to Lucene's current version. Here is the code that needs to be migrated:
public void testUpdate() throws IOException {
assertEquals(1, getHitCount("city", "Amsterdam"));
IndexWriter writer = getWriter();
Document doc = new Document();
doc.add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("country", "Netherlands", Field.Store.YES, Field.Index.NO));
doc.add(new Field("contents", "Den Haag has a lot of museums", Field.Store.NO, Field.Index.ANALYZED));
doc.add(new Field("city", "Den Haag", Field.Store.YES, Field.Index.ANALYZED));
writer.updateDocument(new Term("id", "1"), doc);
writer.close();
assertEquals(0, getHitCount("city", "Amsterdam"));
assertEquals(1, getHitCount("city", "Den Haag"));
}
I'm trying to perform the migration according to the Lucene Migration Guide using the equivalents for the former Field constructors to create the Document object. The code for this looks as follows:
@Test
public void testUpdate() throws IOException
{
assertEquals(1, getHitCount("city", "Amsterdam"));
IndexWriter writer = getWriter();
Document doc = new Document();
FieldType ft = new FieldType(StringField.TYPE_STORED);
ft.setOmitNorms(false);
doc.add(new Field("id", "1", ft));
doc.add(new StoredField("country", "Netherlands"));
doc.add(new TextField("contents", "Den Haag has a lot of museums", Store.NO));
doc.add(new Field("city", "Den Haag", TextField.TYPE_STORED));
writer.updateDocument(new Term("id", "1"), doc);
writer.close();
assertEquals(0, getHitCount("city", "Amsterdam"));
assertEquals(1, getHitCount("city", "Den Haag");
}
The second assertion method fails, because it doesn't find the string "Den Haag" (only "Den" or "Haag" works though). If I use a StringField object instead, the test passes, since the "city" attribute is not anaylzed (i.e. tokenized) and thus is kept unchanged. But it is obviously not the intention of the example to treat this attribute like e.g. an ID. I've read that the combination "Field.Store.YES / Field.Index.ANALYZED" is good for small textual content like an intro text, abstract or title, so it should also match concatenated strings like "Den Haag" or am I wrong? Could anyone clarify please.
The author uses a Term object to create the search string:
protected int getHitCount(String fieldName, String searchString) throws IOException {
DirectoryReader dr = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(dr);
Term t = new Term(fieldName, searchString);
Query query = new TermQuery(t);
int hitCount = TestUtil.hitCount(searcher, query);
return hitCount;
}
The TestUtil class only contains a single line of code
public static int hitCount(IndexSearcher searcher, Query query) {
return searcher.search(query, 1).totalHits;
}
Upvotes: 0
Views: 993
Reputation: 26713
Short explanation: you need to make sure tokenization setting (on/off) is the same at index time and at search time.
Long explanation: If you want your content to be analyzed, you should not only use TextField
but also QueryParser
so your query goes through the same process. In your case your query is failing because with
new Field("city", "Den Haag", TextField.TYPE_STORED));
the text gets tokenized into "Den" and "Haag". Later, when you create TermQuery
you search against a single term "Den Haag" which, of course, yields no results.
Code below shows how could this work for non-tokenized case:
doc.add(new StringField("city", "Den Haag", Field.Store.YES));
...
PhraseQuery query = new PhraseQuery();
query.addTerm(new Term("city", "Den Haag"));
Upvotes: 1