Reputation: 811
I'm using lucene to query from wiki dump and get the categories out. So, I get the relevant documents and for every document, I call the below function.
static List<String> getCategories(Document document) throws IOException
{
List<String> categories = new ArrayList<String>();
String text = document.get("text");
WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(text));
CharTermAttribute termAtt = tf.addAttribute(CharTermAttribute.class);
TypeAttribute typeAtt = tf.addAttribute(TypeAttribute.class);
while (tf.incrementToken())
{
String tokText = termAtt.toString();
if (typeAtt.type().equals(WikipediaTokenizer.CATEGORY) == true)
{
categories.add(tokText);
}
}
return categories;
}
but it throws the following error at the while statement.
Exception in thread "main" java.lang.NullPointerException
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.zzRefill(WikipediaTokenizerImpl.java:574)
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.getNextToken(WikipediaTokenizerImpl.java:781)
at org.apache.lucene.analysis.wikipedia.WikipediaTokenizer.incrementToken(WikipediaTokenizer.java:200)
at SearchIndex.getCategories(SearchIndex.java:82)
at SearchIndex.main(SearchIndex.java:54)
I looked at zzRefill() function but it I'm not able to understand it. Is this a known bug or something? I don't know what am I doing wrong. The lucene guys says that the whole wikipediaTokenizer section is in beta and maybe be subject to changes. I was hoping someone could help me.
Upvotes: 0
Views: 320
Reputation: 811
I solved the problem by adding tf.reset() before calling the while loop
Upvotes: 1