Reputation: 3854

Show contents of Lucene index

I am trying to debug indexing documents in Lucene, and I need to see the contents of the index so I can see exactly how the documents got indexed. Allegedly Luke does this, but there is no documentation for it whatsoever, and when I point it at the index directory (at any of them, although I don't know why it can't figure out which one is right on its own), I get nothing. Surely there is some simple way to do this?

Upvotes: 21

Answers (4)

femtoRgon

Reputation: 33341

Luke IS the simple way to do it. You run it, browse to the index, and are off to the races. Couldn't be easier.

There are other tools out there, like LIMO is also a nice tool for this, but it is harder to get started than Luke.

Perhaps if you give some details on the problem you are running into with Luke, you will be able to get some help with that.

Upvotes: 15

Kewl_guy89

Reputation: 93

It is possible to compile luke from source while adding Elastic search format into Luke MetaINF/services.

Just follow this approach

Using Luke with ElasticSearch

This is also can be followed to test custom posting formats/ Codecs with LUcene

ElasticSearch uses a custom postings format (the postings format defines how the inverted index is represented in memory / on disk), and Luke doesn’t know about it. To tell Luke about the ES postings format, add the SPI class by following the steps below.

Clone Luke source repositry:

2.Add a dependency on your required version of ElasticSearch to the Luke project’s pom file:

<!-- ElasticSearch -->
<dependency>
    <groupId>org.elasticsearch</groupId>
    <artifactId>elasticsearch</artifactId>
    <version>1.1.1</version>
</dependency>

Compile the Luke jar file (creates target/luke-with-deps.jar):

$ mvn package

4.Unpack Luke’s list of known postings formats to a temporary file:

$ unzip target/luke-with-deps.jar META-INF/services/org.apache.lucene.codecs.PostingsFormat -d ./tmp/
Archive:  target/luke-with-deps.jar
  inflating: ./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat

Add the ElasticSearch postings formats to the temp file:

$ echo "org.elasticsearch.index.codec.postingsformat.BloomFilterPostingsFormat"

./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat $ echo "org.elasticsearch.index.codec.postingsformat.Elasticsearch090PostingsFormat" ./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat $ echo "org.elasticsearch.search.suggest.completion.Completion090PostingsFormat" ./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat
Repack the modified file back into the jar:

$ jar -uf target/luke-with-deps.jar -C tmp/ META-INF/services/org.apache.lucene.codecs.PostingsFormat
Run Luke

$./luke.sh

Upvotes: 1

Mark Leighton Fisher

Reputation: 5693

Luke tries to show the values in fields that are indexed but not stored when you use the "Reconstruct & Edit" button from the "Documents" tab. If I recall right, stop words do not show up in the "Reconstruct & Edit" display -- you see things like "null_1", "null_2", etc.

Upvotes: 1

Brandon

Reputation: 10028

I don't know much about Luke, but I have worked with Lucene a lot. To see what is indexed may be tricky, even with Luke, because you can only see the data for stored fields.

For the last Lucene project I did (Solr actually), I had virtually every field marked as indexed but not stored. For those cases, to test if a document had the right indexed term, I would query the index for documents with the given primary key and the expected term. If it matches, then I know it indexed it with that term.

For example, to see if product 5 is in English, I would say productId:5 and lang:en

I know this doesn't directly answer your question about how to use Luke, but this may be an alternative if Luke can't help you.

Upvotes: 3

Show contents of Lucene index

Answers (4)

Related Questions