shashydhar
shashydhar

Reputation: 811

lucene wikipedia querying

I'm using lucene to query from wiki dump and get the categories out. So, I get the relevant documents and for every document, I call the below function.

static List<String> getCategories(Document document) throws IOException
{
    List<String> categories = new ArrayList<String>();
    String text = document.get("text");
    WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(text));

    CharTermAttribute termAtt = tf.addAttribute(CharTermAttribute.class);
    TypeAttribute typeAtt = tf.addAttribute(TypeAttribute.class);

    while (tf.incrementToken())
    {
        String tokText = termAtt.toString();
        if (typeAtt.type().equals(WikipediaTokenizer.CATEGORY) == true)
        {
            categories.add(tokText);
        }
    }

    return categories;
}

but it throws the following error at the while statement.

Exception in thread "main" java.lang.NullPointerException
    at org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.zzRefill(WikipediaTokenizerImpl.java:574)
    at org.apache.lucene.analysis.wikipedia.WikipediaTokenizerImpl.getNextToken(WikipediaTokenizerImpl.java:781)
    at org.apache.lucene.analysis.wikipedia.WikipediaTokenizer.incrementToken(WikipediaTokenizer.java:200)
    at SearchIndex.getCategories(SearchIndex.java:82)
    at SearchIndex.main(SearchIndex.java:54)

I looked at zzRefill() function but it I'm not able to understand it. Is this a known bug or something? I don't know what am I doing wrong. The lucene guys says that the whole wikipediaTokenizer section is in beta and maybe be subject to changes. I was hoping someone could help me.

Upvotes: 0

Views: 320

Answers (1)

shashydhar
shashydhar

Reputation: 811

I solved the problem by adding tf.reset() before calling the while loop

Upvotes: 1

Related Questions