Reputation: 3854
I am trying to debug indexing documents in Lucene, and I need to see the contents of the index so I can see exactly how the documents got indexed. Allegedly Luke does this, but there is no documentation for it whatsoever, and when I point it at the index directory (at any of them, although I don't know why it can't figure out which one is right on its own), I get nothing. Surely there is some simple way to do this?
Upvotes: 21
Views: 31894
Reputation: 33341
Luke IS the simple way to do it. You run it, browse to the index, and are off to the races. Couldn't be easier.
There are other tools out there, like LIMO is also a nice tool for this, but it is harder to get started than Luke.
Perhaps if you give some details on the problem you are running into with Luke, you will be able to get some help with that.
Upvotes: 15
Reputation: 93
It is possible to compile luke from source while adding Elastic search format into Luke MetaINF/services.
Just follow this approach
This is also can be followed to test custom posting formats/ Codecs with LUcene
ElasticSearch uses a custom postings format (the postings format defines how the inverted index is represented in memory / on disk), and Luke doesn’t know about it. To tell Luke about the ES postings format, add the SPI class by following the steps below.
2.Add a dependency on your required version of ElasticSearch to the Luke project’s pom file:
<!-- ElasticSearch -->
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<version>1.1.1</version>
</dependency>
Compile the Luke jar file (creates target/luke-with-deps.jar):
$ mvn package
4.Unpack Luke’s list of known postings formats to a temporary file:
$ unzip target/luke-with-deps.jar META-INF/services/org.apache.lucene.codecs.PostingsFormat -d ./tmp/
Archive: target/luke-with-deps.jar
inflating: ./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat
Add the ElasticSearch postings formats to the temp file:
$ echo "org.elasticsearch.index.codec.postingsformat.BloomFilterPostingsFormat"
./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat $ echo "org.elasticsearch.index.codec.postingsformat.Elasticsearch090PostingsFormat" ./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat $ echo "org.elasticsearch.search.suggest.completion.Completion090PostingsFormat" ./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat
Repack the modified file back into the jar:
$ jar -uf target/luke-with-deps.jar -C tmp/ META-INF/services/org.apache.lucene.codecs.PostingsFormat
Run Luke
$./luke.sh
Upvotes: 1
Reputation: 5693
Luke tries to show the values in fields that are indexed but not stored when you use the "Reconstruct & Edit" button from the "Documents" tab. If I recall right, stop words do not show up in the "Reconstruct & Edit" display -- you see things like "null_1", "null_2", etc.
Upvotes: 1
Reputation: 10028
I don't know much about Luke, but I have worked with Lucene a lot. To see what is indexed may be tricky, even with Luke, because you can only see the data for stored fields.
For the last Lucene project I did (Solr actually), I had virtually every field marked as indexed but not stored. For those cases, to test if a document had the right indexed term, I would query the index for documents with the given primary key and the expected term. If it matches, then I know it indexed it with that term.
For example, to see if product 5 is in English, I would say productId:5 and lang:en
I know this doesn't directly answer your question about how to use Luke, but this may be an alternative if Luke can't help you.
Upvotes: 3