Gotz84
Gotz84

Reputation: 45

Problems with Solr Tokenizer adding a lemmatizer

I'm adding a text lemmatizer to Solr. I have to process the entire text because the context in lemmatization is important.

I get this code on internet and I modified a bit

http://grokbase.com/t/lucene/solr-user/138d0qn4v0/issue-with-custom-tokenizer

I added our lemmatizer and I changed this line

endOffset = word.length();

for this

endOffset = startOffset + word.length();

Now if I use the Solr Admin analisys, I have no problems in Index or Query values. I write the phrase and when I analyse values, the results is the text well lemmatized.

The problems are when I make queries on Query section and when I index documents. Checking debugquery I can see this. If I ask for "korrikan" text (means "running") in "naiz_body", the text is well lemmatized.

<str name="rawquerystring">naiz_body:"korrikan"</str>
<str name="querystring">naiz_body:"korrikan"</str>
<str name="parsedquery">naiz_body:korrika</str>
<str name="parsedquery_toString">naiz_body:korrika</str>

Now if at the moment I ask for "jolasten" text (means "playing") the text is not lemmatized, and the parsedquery and parsedquery_toString is not changed.

<str name="rawquerystring">naiz_body:"jolasten"</str>
<str name="querystring">naiz_body:"jolasten"</str>
<str name="parsedquery">naiz_body:korrika</str>
<str name="parsedquery_toString">naiz_body:korrika</str>

If I wait for a bit (or if I stop solr and I run it) and I ask for "jolasten" text I get the text well lemmatized

<str name="rawquerystring">naiz_body:"jolasten"</str>
<str name="querystring">naiz_body:"jolasten"</str>
<str name="parsedquery">naiz_body:jolastu</str>
<str name="parsedquery_toString">naiz_body:jolastu</str>

Why?

Here is the code:

package eu.solr.analysis;

import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.List;

import eu.solr.analysis.Lemmatizer;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;

public class LemmatizerTokenizer extends Tokenizer {
    private Lemmatizer lemmatizer = new Lemmatizer();
    private List<Token> tokenList = new ArrayList<Token>();
    int tokenCounter = -1;

    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    private final OffsetAttribute offsetAttribute = (OffsetAttribute)addAttribute(OffsetAttribute.class);
    private final PositionIncrementAttribute position = (PositionIncrementAttribute)addAttribute(PositionIncrementAttribute.class);

    public LemmatizerTokenizer(AttributeFactory factory, Reader reader) {
        super(factory, reader);
        System.out.println("### Lemmatizer Tokenizer ###");
        String textToProcess = null;
        try {
            textToProcess = readFully(reader);
            processText(textToProcess);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public String readFully(Reader reader) throws IOException {
        char[] arr = new char[8 * 1024]; // 8K at a time
        StringBuffer buf = new StringBuffer();
        int numChars;
        while ((numChars = reader.read(arr, 0, arr.length)) > 0) {
            buf.append(arr, 0, numChars);
        }
        System.out.println("### Read Fully ### => " + buf.toString());
        return lemmatizer.getLemma(buf.toString());
    }

    public void processText(String textToProcess) {
        System.out.println("### Process Text ### => " + textToProcess);
        String wordsList[] = textToProcess.split(" ");
        int startOffset = 0, endOffset = 0;
        for (String word : wordsList) {
            endOffset = startOffset + word.length();
            Token aToken = new Token(word, startOffset, endOffset);
            aToken.setPositionIncrement(1);
            tokenList.add(aToken);
            startOffset = endOffset + 1;
        }
    }

    @Override
    public boolean incrementToken() throws IOException {
        clearAttributes();
        tokenCounter++;
        System.out.println("### Increment Token ###");
        System.out.println("Token Counter => " + tokenCounter);
        System.out.println("TokenList size => " + tokenList.size());
        if (tokenCounter < tokenList.size()) {
            Token aToken = tokenList.get(tokenCounter);
            System.out.println("Increment Token => " + aToken.toString());
            termAtt.append(aToken);
            termAtt.setLength(aToken.length());
            offsetAttribute.setOffset(correctOffset(aToken.startOffset()),
            correctOffset(aToken.endOffset()));
            position.setPositionIncrement(aToken.getPositionIncrement());
            return true;
        }
        return false;
    }

    @Override
    public void close() throws IOException {
        System.out.println("### Close ###");
        super.close();
    }

    @Override
    public void end() throws IOException {
        // setting final offset
        System.out.println("### End ###");
        super.end();
    }

    @Override
    public void reset() throws IOException {
        System.out.println("### Reset ###");
        tokenCounter = -1;
        super.reset();
    }
}

Thank you all!

edit:

answer to @alexandre-rafalovitch The Analysis screen in Admin UI works well. If I make a query or index text, the text is well lemmatized. The problem is in the Query UI. If I make a query first calls to lemmatizer, but the second one looks like uses the buffered first lemmatized text and calls directly to incrementToken. See the code output when I make this queries: In Analysis UI if I query for Korrikan and then for Jolasten It outputs this:

## BasqueLemmatizerTokenizer create
### BasqueLemmatizer Tokenizer ###
### Read Fully ### => korrikan
### Eustagger OUT ### => korrika  
### Process Text ### => korrika  
### Reset ###
### Increment Token ###
Token Counter => 0
TokenList size => 1
Increment Token => korrika
### Increment Token ###
Token Counter => 1
TokenList size => 1

## BasqueLemmatizerTokenizer create
### BasqueLemmatizer Tokenizer ###
### Read Fully ### => Jolasten
### Eustagger OUT ### => jolastu  
### Process Text ### => jolastu  
### Reset ###
### Increment Token ###
Token Counter => 0
TokenList size => 1
Increment Token => jolastu
### Increment Token ###
Token Counter => 1
TokenList size => 1

If I make this query on Query UI it outputs this:

## BasqueLemmatizerTokenizer create
### BasqueLemmatizer Tokenizer ###
### Read Fully ### => korrikan
### Eustagger OUT ### => korrika  
### Process Text ### => korrika  
### Reset ###
### Increment Token ###
Token Counter => 0
TokenList size => 1
Increment Token => korrika
### Increment Token ###
Token Counter => 1
TokenList size => 1
### End ###
### Close ###

### Reset ###
### Increment Token ###
Token Counter => 0
TokenList size => 1
Increment Token => korrika
### Increment Token ###
Token Counter => 1
TokenList size => 1
### End ###
### Close ###

In the second one, it doens't create a tokenizer, looks like it reset it but it read the old text.

I wrote to the code owner and he answered me to see TrieTokenizer.

Upvotes: 2

Views: 742

Answers (2)

Gotz84
Gotz84

Reputation: 45

Finally I did!

I modified the PatternTokenizer and then I used the StandardTokenizer to use the lemmatizer. In brief, I lemmatize the string from input, and then create an StringReader with the lemmatized text.

Here is the code, hope it can be useful for somebody (Modifying the StandardTokenizer script):

...

public String processReader(Reader reader) throws IOException {
    char[] arr = new char[8 * 1024]; // 8K at a time
    StringBuffer buf = new StringBuffer();
    int numChars;
    while ((numChars = reader.read(arr, 0, arr.length)) > 0) {
        buf.append(arr, 0, numChars);
    }
    return lemmatizer.getLemma(buf.toString());
}

...

public void reset() throws IOException {
    scanner.yyreset(new StringReader(processReader(input)));
}

Upvotes: 1

Alexandre Rafalovitch
Alexandre Rafalovitch

Reputation: 9789

Are you sure the problem is lematizer? You can check it by putting text in Analysis screen in Admin UI. Enter text and see what the analyzer chain does.

However, the following part:

If I wait for a bit (or if I stop solr and I run it) and I ask for "jolasten" text I get the text well lemmatized

makes me think that maybe you are just forgetting to commit your indexed text. Then, the delay before content shows up would be explained by the soft commit with the interval configured in your solrconfig.xml.

Upvotes: 0

Related Questions