Asma
Asma

Reputation: 189

Indexing wikipedia dumps without losing information

I'm working on a search engine using wikipedia dumps. I've split, parsed and extracted clean text from the articles and the next step is to build an index. I chose to use pylucene for that task, but the question is, should I index the whole article (the whole wikipedia page) or section by section (each section contains approximately 2 to 4 paragraphs)? I don't wanna lose any information and I want to get the right paragraph that contains the answer to each question asked in the search engine.

Upvotes: 1

Views: 93

Answers (0)

Related Questions