Reputation: 189
I'm working on a search engine using wikipedia dumps. I've split, parsed and extracted clean text from the articles and the next step is to build an index. I chose to use pylucene for that task, but the question is, should I index the whole article (the whole wikipedia page) or section by section (each section contains approximately 2 to 4 paragraphs)? I don't wanna lose any information and I want to get the right paragraph that contains the answer to each question asked in the search engine.
Upvotes: 1
Views: 93