Reputation: 4151
I am using the german language analyzer to tokenize some content. I know that it is basically a macro filter for "lowercase","german_stop", "german_keywords", "german_normalization", "german_stemmer".
My problem has to do with the nomalization filter. Here is the Elasticsearch Documentation and the Lucene Implementation of the filter. The problem is that ae ue and oe are treated as the german letters ä,ö and ü and therefore transformed to a,o,u.
The second transformation is good but the first leads to more problems than it solves. There is usually no ae,ue,oe in german texts that really represents ä, ü, ö. Most of the times they actually appear are inside foreign words, derived from latin or english like 'Aearodynamik' (aerodynamics). The filter then interprets 'Ae' as 'Ä' then transforns it to 'A'. This yields 'arodynamik' as token. Normally this is not a problem since the search word is also normalized with that filter. This does however become a problem if combined with wildcard search:
Imagine a word like 'FooEdit', this will be tokenized to 'foodit'. A search for 'edit OR *edit*' (which is my normal search when the user searches for 'edit') will not yield a result since the 'e' of 'edit' got lost. Since my content has a lot of words like that and people are searching for partial words it's not as much of an edge case as it seems.
So my question is is there any way to get rid of the 'ae -> a' transformations? My understanding is that this is part of the German2 snowball algorithm so probably this can't be changed. Does that mean I would have to get rid of the whole normalization step or can I provide my own version of the snowball algorithm where I just strip the parts that I don't like (didn't find any documentation on how to use a custom snowball algorithm for normalization)?
Cheers
Tom
Upvotes: 0
Views: 2038
Reputation: 33341
This transformation is handled by the GermanNormalizationFilter
, rather than the stemmer. It's really not that difficult a class to understand (unlike many stemmers), and if I understand correctly, looks like a one-line change will get you want you want:
public final class CustomGermanNormalizationFilter extends TokenFilter {
// FSM with 3 states:
private static final int N = 0; /* ordinary state */
private static final int V = 1; /* stops 'u' from entering umlaut state */
private static final int U = 2; /* umlaut state, allows e-deletion */
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
public CustomGermanNormalizationFilter(TokenStream input) {
super(input);
}
@Override
public boolean incrementToken() throws IOException {
if (input.incrementToken()) {
int state = N;
char buffer[] = termAtt.buffer();
int length = termAtt.length();
for (int i = 0; i < length; i++) {
final char c = buffer[i];
switch(c) {
//Removing this case should prevent e-deletion for "ae"
// case 'a':
case 'o':
state = U;
break;
case 'u':
state = (state == N) ? U : V;
break;
case 'e':
if (state == U)
length = StemmerUtil.delete(buffer, i--, length);
state = V;
break;
case 'i':
case 'q':
case 'y':
state = V;
break;
case 'ä':
buffer[i] = 'a';
state = V;
break;
case 'ö':
buffer[i] = 'o';
state = V;
break;
case 'ü':
buffer[i] = 'u';
state = V;
break;
case 'ß':
buffer[i++] = 's';
buffer = termAtt.resizeBuffer(1+length);
if (i < length)
System.arraycopy(buffer, i, buffer, i+1, (length-i));
buffer[i] = 's';
length++;
state = N;
break;
default:
state = N;
}
}
termAtt.setLength(length);
return true;
} else {
return false;
}
}
}
Using that in place of german_normalization
should do the trick.
Upvotes: 1
Reputation: 1234
As you said, the german analyzer is a pipeline combining the steps you listed. (Documentation)
In theory, you could specify your own analyzer just like above and replace the german_normalization filter with another one. For example a Pattern Replace Token Filter. I have never used it, but I'd guess the syntax is equal to the Char Replace Token Filter (link).
Upvotes: 1