user2773013
user2773013

Reputation: 3170

Problem chaining tokenizer with filters with PythonAnalyzer in PyLucene

I'm new to PyLucene. I managed to install it on my Ubuntu, and looked at [sample code][1] of how custom analyzer is implemented. I tried modifying it by adding an NGramTokenFilter. But I keep getting an error when printing out the result of the custom analyzer. If I remove the ngram filter, it would work just fine.

Basically what I'm trying to do is split all incoming text by white space, lower case it, convert to ascii code, and do ngram.

The code is as follow:

class myAnalyzer(PythonAnalyzer):

def createComponents(self, fieldName):
    source = WhitespaceTokenizer()
    filter = LowerCaseFilter(source)
    filter = ASCIIFoldingFilter(filter)
    filter = NGramTokenFilter(filter,1,2)

    return self.TokenStreamComponents(source, filter)

def initReader(self, fieldName, reader):
    return reader


analyzer = myAnalyzer()
stream = analyzer.tokenStream("", StringReader("MARGIN wondêrfule"))
stream.reset()
tokens=[]
while stream.incrementToken():
    tokens.append(stream.getAttribute(CharTermAttribute.class_).toString())
print(tokens)

The error I keep getting is:

InvalidArgsError: (<class 'org.apache.lucene.analysis.ngram.NGramTokenFilter'>, '__init__', (<ASCIIFoldingFilter: ASCIIFoldingFilter@192d74fb term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1>, 2, 3))

What am I doing wrong?

Upvotes: 0

Views: 72

Answers (1)

andrewJames
andrewJames

Reputation: 22042

Looking at the JavaDoc for NGramTokenFilter, you have to use this:

NGramTokenFilter(filter, size)

for a fixed ngram size; or this:

NGramTokenFilter(filter, min, max, boolean) 

for a range of ngram sizes with a boolean for preserveOriginal, which determines:

Whether or not to keep the original term when it is shorter than minGram or longer than maxGram

What you have is neither of those.

(Side note: I'm not sure an ngram of size 1 makes a lot of sense - but maybe it's what you need.)


Update

Just for completeness, here is my (somewhat modified) standalone version of the code in the question. The most relevant part is this line:

result = NGramTokenFilter(filter, 1, 2, True)

The program (using PyLucene 9.4.1 and Java 11):

import sys, lucene, unittest
from org.apache.pylucene.analysis import PythonAnalyzer
from org.apache.lucene.analysis import Analyzer
from java.io import StringReader
from org.apache.lucene.analysis.core import WhitespaceTokenizer, LowerCaseFilter
from org.apache.lucene.analysis.miscellaneous import ASCIIFoldingFilter
from org.apache.lucene.analysis.ngram import NGramTokenFilter
from org.apache.lucene.analysis.tokenattributes import CharTermAttribute

class myAnalyzer(PythonAnalyzer):

    def __init__(self):
        PythonAnalyzer.__init__(self)

    def createComponents(self, fieldName):
        source = WhitespaceTokenizer()
        filter = LowerCaseFilter(source)
        filter = ASCIIFoldingFilter(filter)
        result = NGramTokenFilter(filter, 1, 2, True)

        return Analyzer.TokenStreamComponents(source, result)

    def initReader(self, fieldName, reader):
        return reader


lucene.initVM(vmargs=['-Djava.awt.headless=true'])
analyzer = myAnalyzer()
stream = analyzer.tokenStream("", StringReader("MARGIN wondêrfule"))
stream.reset()
tokens=[]
while stream.incrementToken():
    tokens.append(stream.getAttribute(CharTermAttribute.class_).toString())
print(tokens)

The output:

['m', 'ma', 'a', 'ar', 'r', 'rg', 'g', 'gi', 'i', 'in', 'n', 'margin', 'w', 'wo', 'o', 'on', 'n', 'nd', 'd', 'de', 'e','er', 'r', 'rf', 'f', 'fu', 'u', 'ul', 'l', 'le', 'e', 'wonderfule']

Upvotes: 1

Related Questions