Using Lucene to index Wikipedia

Question

Is it possible to use Lucene Benchmark to index a wikipedia dump? I want to be able to execute phrase queries on the latest english wikipedia page dump. I'm trying to look for example use cases but I haven't found any.

I downloaded the latest english dump, named: enwiki-latest-pages-articles.xml.bz2

Then I ran the command in the terminal: java org.apache.lucene.benchmark.utils.ExtractWikipedia -i ~/enwiki-latest-pages-articles.xml.bz2

which I believe extracted the pages into a directory labeled "enwiki"

Now is there something else in benchmarks that I need to run in order to index the wiki? The README.enwiki does not really give me a clear set of instructions, in fact I'm not even sure if I was supposed to run the ExtractWikipedia class or not.

Mike Sokolov · Accepted Answer

Just run "ant"; I posted a more thorough answer on the Lucene mailing list, but that is basically the gist of it. The build.xml file has a bunch of targets for running benchmarks.

Using Lucene to index Wikipedia

Answers (2)

Related Questions