xpages-noob
xpages-noob

Reputation: 1579

How to get DocValue by document ID in Lucene 7+?

I'm adding a DocValue to a document with

doc.add(new BinaryDocValuesField("foo",new BytesRef("bar")));

To retrieve that value for a specific document with ID docId, I call

DocValues.getBinary(reader,"foo").get(docId).utf8ToString();

The get function in BinaryDocValues is supported up to Lucene 6.6, but for Lucene 7.0 and up it does not seem to be available anymore.

So, how do I get the DocValue by document ID in Lucene 7+ (without having to iterate over BinaryDocValues / DocIdSetIterator, and without having to re-get BinaryDocValues and use advanceExact every time) ?

Upvotes: 2

Views: 2686

Answers (1)

Ivan Mamontov
Ivan Mamontov

Reputation: 2924

Theory

Doc values are Lucene's column-stride field value storage. Doc values were intended to be quite fast for random access at query time for faceting and sorting purposes. The following issue LUCENE-7407 switches access pattern from random-access to an iterator. Because an iterator API is a much more restrictive access pattern than an arbitrary random access API, this change gives Lucene more freedom and power to use aggressive compression and other optimizations:

  • reduction of disc space usage in case of sparse data
  • better compression ratio and speed of decoding of doc values, even in the non-sparse case
  • remove special column of missing values(getDocsWithField) and thread local codec readers

You can read about this change in the following blogs:

Practice

In practice this change causes performance degradation in some cases, for example SOLR-9599. In major case(faceting and sorting) an iterative API is OK with proper usage and, even more, allows to perform some optimizations. In fact there are a lot of cases where this API is not a good solution. All these cases were discarded as an incorrect usage(the same problem we had in java word with sun.misc.Unsafe).

In fact, org.apache.lucene.index.DocValuesIterator#advanceExact is quite fast and has similar performance and complexity in case of some implementations.

Upvotes: 10

Related Questions