denic
denic

Reputation: 31

IntPoint is not indexing integer values

While we're trying to index integer values with the Field Type "IntPoint", the values seem to be not transferred correctly into our Lucene Index.

We are working with Lucene 6.0.

According to the Lucene documention the code snippet:

        doc.add(new IntPoint(LENGTH2, 17));

should add the LENGTH2 document field with the Value "17" to our indexable document. Unfortunately, there are no values in our index field LENGHT2.

We also tried this with the deprecated Field Type "LegacyIntField". With this Type we got some cryptic smybols in our index field like:

        950 length  h(5j
        950 length  pT
        950 length  xj
        950 length  `PkU2

For this type we used the following code:

        LegacyIntField intField = new LegacyIntField(LENGTH,0,Field.Store.NO);
        intField.setIntValue(17);
        doc.add(intField);

Do you know a solution for this problem?

Addition: Do you have an working example, which includes the indexing and searching of IntPoint? We tried it out, but it seems to be not working. Java Code for Lucene IntPoint Indexing

We also tried to do some search on this field. But the results didn't match with that what we expected.

   QueryParser parser = new MultiFieldQueryParser(fields,analyzer);
   Query query = parser.parse("content:" + queryString);

   Query queryNumeric = IntPoint.newRangeQuery(Indexer.LENGTH2, 0, 5);
   Builder builder = new Builder();
   builder.add(query, Occur.MUST);
   builder.add(queryNumeric, Occur.MUST);
   BooleanQuery booleanQ = builder.build();

   TopDocs hits2 = is.search(booleanQ, 1000);
   System.out.print("short: " + hits2.totalHits);

Upvotes: 3

Views: 3248

Answers (2)

mike rodent
mike rodent

Reputation: 15682

This Lucene 6+ class IntPoint (and related classes) seems to be rather poorly documented.

From my experiments it would appear that this is NOT a replacement of IntField, and that the latter always was a "fiction" and has just been dumped.

It appears that mostly people who want to store an Integer field in their Lucene Documents comprising an index should realise that in fact you'll be storing a text representation of this Integer. Obviously the same applied with earlier versions of Lucene, with IntField: index files consist of bytes of text*.

For this reason, for simple storage of Integers, you should just use a single field, of class StoredField or class StringField, and not include an IntPoint field at all.

Recap

  • TextField is tokenised, indexed and stored.
  • StringField is indexed and stored.
  • StoredField is just stored.

Clearly tokenising a number is nonsensical. Whether you use StringField or StoredField then depends on whether you need to index your number, or whether it is just an item of data contained in the Document which will be needed when the Document is retrieved. Note that all these classes are subclasses of class Field, and inherit all its methods, including setIntValue. All, despite the names TextField and StringField, are in fact therefore "polyvalent".

The proof that all this stored stuff is really just text is found when you do a search and get your Documents back: when you do

Field myRetrievedField = myRetrievedLDocument.getField( intFieldName )

what you get back is a StoredField... this is the same if you store a TextField: i.e. whatever the Field subclass you used to do the storing what is stored is a StoredField! And because a StoredField is indeed "polyvalent" you can then either go

myRetrievedField.numericValue()

or

myRetrievedField.stringValue()

The result from numericValue() is class java.lang.Number ... but said to be

Non-null if this field has a numeric value

... i.e. StoredField obviously parses this returned text and tries to make it into a Number. If the text is not parsable as such you get back null.

Presumably the stored number is always stored as in "Anglo-Saxon" format: i.e. with dot for decimal point (unlike French: comma instead). If Lucene implemented some sort of "locale" setting with this regard it could certainly impair the compatibility/readability of retrieved Documents. French (and other) people who are dissatisfied with this cultural hegemony (hégémonie culturelle), if such it be, always have the option of storing numbers as text with "decimal commas" etc. and then using stringValue instead of numericValue and parsing with their own parser!

But the main point is just that IntPoint has nothing to do with storing Integers.


* It may be that in some Lucene implementations, past, present or future, that numbers are stored in a more compact way than just spelling them out using (one- byte UTF-8) Strings. This is because the number of possible characters is less than 256: less than 16, in fact: numbers 0 to 9, - (minus), ., space and "E" for exponential. So in fact numbers could be stored in half the space of the equivalent String. Whether Lucene currently bothers saving space in this way I have no idea but it could presumably be found out by getting stringValue on a retrieved Field which had been stored using setIntValue.

Upvotes: 3

Khurram Shehzad
Khurram Shehzad

Reputation: 51

Besides that you're indexing field as:

doc.add(new IntPoint(LENGTH2, 17));

You also need to store the field separately by adding separate instance of StoredField:

doc.add(new StoredField(LENGTH2,17));

According to the documentation of IntPoint

An indexed int field for fast range filters. If you also need to store the value, you should add a separate StoredField instance.

Reference: https://lucene.apache.org/core/6_1_0/core/org/apache/lucene/document/IntPoint.html

Upvotes: 4

Related Questions