Reputation: 31
While we're trying to index integer values with the Field Type "IntPoint", the values seem to be not transferred correctly into our Lucene Index.
We are working with Lucene 6.0.
According to the Lucene documention the code snippet:
doc.add(new IntPoint(LENGTH2, 17));
should add the LENGTH2 document field with the Value "17" to our indexable document. Unfortunately, there are no values in our index field LENGHT2.
We also tried this with the deprecated Field Type "LegacyIntField". With this Type we got some cryptic smybols in our index field like:
950 length h(5j
950 length pT
950 length xj
950 length `PkU2
For this type we used the following code:
LegacyIntField intField = new LegacyIntField(LENGTH,0,Field.Store.NO);
intField.setIntValue(17);
doc.add(intField);
Do you know a solution for this problem?
Addition: Do you have an working example, which includes the indexing and searching of IntPoint? We tried it out, but it seems to be not working. Java Code for Lucene IntPoint Indexing
We also tried to do some search on this field. But the results didn't match with that what we expected.
QueryParser parser = new MultiFieldQueryParser(fields,analyzer);
Query query = parser.parse("content:" + queryString);
Query queryNumeric = IntPoint.newRangeQuery(Indexer.LENGTH2, 0, 5);
Builder builder = new Builder();
builder.add(query, Occur.MUST);
builder.add(queryNumeric, Occur.MUST);
BooleanQuery booleanQ = builder.build();
TopDocs hits2 = is.search(booleanQ, 1000);
System.out.print("short: " + hits2.totalHits);
Upvotes: 3
Views: 3248
Reputation: 15682
This Lucene 6+ class IntPoint
(and related classes) seems to be rather poorly documented.
From my experiments it would appear that this is NOT a replacement of IntField
, and that the latter always was a "fiction" and has just been dumped.
It appears that mostly people who want to store an Integer
field in their Lucene Documents
comprising an index should realise that in fact you'll be storing a text representation of this Integer
. Obviously the same applied with earlier versions of Lucene, with IntField
: index files consist of bytes of text*.
For this reason, for simple storage of Integer
s, you should just use a single field, of class StoredField
or class StringField
, and not include an IntPoint
field at all.
Recap
TextField
is tokenised, indexed and stored. StringField
is indexed and stored.StoredField
is just stored.Clearly tokenising a number is nonsensical. Whether you use StringField
or StoredField
then depends on whether you need to index your number, or whether it is just an item of data contained in the Document
which will be needed when the Document
is retrieved. Note that all these classes are subclasses of class Field
, and inherit all its methods, including setIntValue
. All, despite the names TextField
and StringField
, are in fact therefore "polyvalent".
The proof that all this stored stuff is really just text is found when you do a search and get your Document
s back: when you do
Field myRetrievedField = myRetrievedLDocument.getField( intFieldName )
what you get back is a StoredField
... this is the same if you store a TextField
: i.e. whatever the Field
subclass you used to do the storing what is stored is a StoredField
! And because a StoredField
is indeed "polyvalent" you can then either go
myRetrievedField.numericValue()
or
myRetrievedField.stringValue()
The result from numericValue()
is class java.lang.Number
... but said to be
Non-null if this field has a numeric value
... i.e. StoredField
obviously parses this returned text and tries to make it into a Number
. If the text is not parsable as such you get back null
.
Presumably the stored number is always stored as in "Anglo-Saxon" format: i.e. with dot for decimal point (unlike French: comma instead). If Lucene implemented some sort of "locale" setting with this regard it could certainly impair the compatibility/readability of retrieved Document
s. French (and other) people who are dissatisfied with this cultural hegemony (hégémonie culturelle), if such it be, always have the option of storing numbers as text with "decimal commas" etc. and then using stringValue
instead of numericValue
and parsing with their own parser!
But the main point is just that IntPoint
has nothing to do with storing Integer
s.
* It may be that in some Lucene implementations, past, present or future, that numbers are stored in a more compact way than just spelling them out using (one- byte UTF-8) String
s. This is because the number of possible characters is less than 256: less than 16, in fact: numbers 0 to 9, - (minus), ., space and "E" for exponential. So in fact numbers could be stored in half the space of the equivalent String
. Whether Lucene currently bothers saving space in this way I have no idea but it could presumably be found out by getting stringValue
on a retrieved Field
which had been stored using setIntValue
.
Upvotes: 3
Reputation: 51
Besides that you're indexing field as:
doc.add(new IntPoint(LENGTH2, 17));
You also need to store the field separately by adding separate instance of StoredField
:
doc.add(new StoredField(LENGTH2,17));
According to the documentation of IntPoint
An indexed int field for fast range filters. If you also need to store the value, you should add a separate StoredField instance.
Reference: https://lucene.apache.org/core/6_1_0/core/org/apache/lucene/document/IntPoint.html
Upvotes: 4