Reputation: 2257
Is it possible to use Lucene Benchmark to index a wikipedia dump? I want to be able to execute phrase queries on the latest english wikipedia page dump. I'm trying to look for example use cases but I haven't found any.
I downloaded the latest english dump, named: enwiki-latest-pages-articles.xml.bz2
Then I ran the command in the terminal: java org.apache.lucene.benchmark.utils.ExtractWikipedia -i ~/enwiki-latest-pages-articles.xml.bz2
which I believe extracted the pages into a directory labeled "enwiki"
Now is there something else in benchmarks that I need to run in order to index the wiki? The README.enwiki does not really give me a clear set of instructions, in fact I'm not even sure if I was supposed to run the ExtractWikipedia class or not.
Upvotes: 1
Views: 3280
Reputation: 3609
The Wikimedia Foundation has been working on new project called DiffDb. Using Hadoop we create the diff between two revisions and all those diffs are indexed using Lucene. You can find the code at github:
The resulting index for just the English Wikipedia is 1.4Tb but you can do really cool queries, such as who added foo in april 2005, who removed more than 10k in bytes, etc etc.
Upvotes: 0
Reputation: 7054
Just run "ant"; I posted a more thorough answer on the Lucene mailing list, but that is basically the gist of it. The build.xml file has a bunch of targets for running benchmarks.
Upvotes: 1