Reputation: 16081
I want to use the Lucene API to extract ngrams from sentences. However I seem to be running into a peculiar problem. In the JavaDoc there is a class called NGramTokenizer. I have downloaded both the 3.6.1 and 4.0 API's and I do not see any trace of this class. For example when I try the following I get an error stating that the symbol NGramTokenizer cannot be found:
NGramTokenizer myTokenizer;
In the documentation it appears that the NGramTokenizer is in the path org.apache.lucene.analysis.NGramTokenizer. I do not see this anywhere on my computer. It does not seem likely that a download or other miscellaneous error has occurred since this happens with both the 3.6.1 and 4.0 API's
Upvotes: 0
Views: 1661
Reputation: 1660
Here is a utility method I usually use incase someone needs help with this. Should work with lucene 4.10 (I didn't test with lower or higher versions)
private Set<String> generateNgrams(String sentence, int ngramCount) {
StringReader reader = new StringReader(sentence);
Set<String> ngrams = new HashSet<>();
//use lucene's shingle filter to generate the tokens
StandardTokenizer source = new StandardTokenizer(reader);
TokenStream tokenStream = new StandardFilter(source);
TokenFilter sf = null;
//if only unigrams are needed use standard filter else use shingle filter
if(ngramCount == 1){
sf = new StandardFilter(tokenStream);
}
else{
sf = new ShingleFilter(tokenStream);
((ShingleFilter)sf).setMaxShingleSize(ngramCount);
}
CharTermAttribute charTermAttribute = sf.addAttribute(CharTermAttribute.class);
try {
sf.reset();
while (sf.incrementToken()) {
String token = charTermAttribute.toString().toLowerCase();
ngrams.add(token);
}
sf.end();
sf.close();
} catch (IOException ex) {
// System.err.println("Scream and cry as desired");
ex.printStackTrace();
}
return ngrams;
}
Maven dependencies required for lucene are:
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>4.10.3</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
<version>4.10.3</version>
</dependency>
Upvotes: 0
Reputation: 4310
You are using the wrong jar. It's in
lucene-analyzers-3.6.1.jar
org.apache.lucene.analysis.ngram.NGramTokenizer
Upvotes: 3