user3327940
user3327940

Reputation: 11

Choose the good analyzer for French AND programmatic languages

I'm working on a search engine ( Java J2ee, Hibernate, Hibernate Search and Lucene ). I analyse documents content. All my system is OK, but one problem is persisting. The choice of the analyzer !

My documents are in French, and it's important because the research are in French. But my search engine should be able to search programming language names like ( Java, COBOL, C, C#, C++, .... )

I'm now using the French analyzer of Lucene, and the problem is the results of the requests for the terms "C", "C++", "C#". I would like to have : ["C" or "C++" or "C#"] => "C" but I have => ""

I'm actually a newbie on these technologies and I would like to know which analyzer should I use OR if I have to implement a specific one.

(I'm Using hibernate search 3.0.0.GA (which is VERY old ... ) and I can't change the version).

Thanxs

Upvotes: 0

Views: 955

Answers (2)

user3327940
user3327940

Reputation: 11

I come back because I am not satisfied by my solution ( Doesn't work ... ). I modified the StopWords list ( I suppressed the "c" word ) and I put ("C","C++","C#") in the stem exclusion list. I modified the constructor to set the Stem exclusion list.

In my class to index's file i've got :

// I've verified my custom constructor was called
@Analyzer(impl = CustomFrenchAnalyzer.class)
...

I create a print of the query constructed with my custom analyzer and send to lucene, for the keyWords C, C++, C# the query is SContent:c (and not c, c++ or c# like i would )

If somebody know why ??

Here is my CustomFrenchAnalyzer class :

public class CustomFrenchAnalyzer extends Analyzer {

protected static final Log LOG = LogFactory.getLog(CustomFrenchAnalyzer.class);
/**
 * Extended list of custom French stopwords ( Without "c" ).
 */
public final static String[] FRENCH_STOP_WORDS = { "a", "afin", "ai", "ainsi", "après", "attendu", "au", "aujourd", "auquel", "aussi", "autre", "autres", "aux", "auxquelles", "auxquels", "avait",
        "avant", "avec", "avoir", "car", "ce", "ceci", "cela", "celle", "celles", "celui", "cependant", "certain", "certaine", "certaines", "certains", "ces", "cet", "cette", "ceux", "chez",
        "ci", "combien", "comme", "comment", "concernant", "contre", "d", "dans", "de", "debout", "dedans", "dehors", "delà", "depuis", "derrière", "des", "désormais", "desquelles", "desquels",
        "dessous", "dessus", "devant", "devers", "devra", "divers", "diverse", "diverses", "doit", "donc", "dont", "du", "duquel", "durant", "dès", "elle", "elles", "en", "entre", "environ",
        "est", "et", "etc", "etre", "eu", "eux", "excepté", "hormis", "hors", "hélas", "hui", "il", "ils", "j", "je", "jusqu", "jusque", "l", "la", "laquelle", "le", "lequel", "les",
        "lesquelles", "lesquels", "leur", "leurs", "lorsque", "lui", "là", "ma", "mais", "malgré", "me", "merci", "mes", "mien", "mienne", "miennes", "miens", "moi", "moins", "mon", "moyennant",
        "même", "mêmes", "n", "ne", "ni", "non", "nos", "notre", "nous", "néanmoins", "nôtre", "nôtres", "on", "ont", "ou", "outre", "où", "par", "parmi", "partant", "pas", "passé", "pendant",
        "plein", "plus", "plusieurs", "pour", "pourquoi", "proche", "près", "puisque", "qu", "quand", "que", "quel", "quelle", "quelles", "quels", "qui", "quoi", "quoique", "revoici", "revoilà",
        "s", "sa", "sans", "sauf", "se", "selon", "seront", "ses", "si", "sien", "sienne", "siennes", "siens", "sinon", "soi", "soit", "son", "sont", "sous", "suivant", "sur", "ta", "te", "tes",
        "tien", "tienne", "tiennes", "tiens", "toi", "ton", "tous", "tout", "toute", "toutes", "tu", "un", "une", "va", "vers", "voici", "voilà", "vos", "votre", "vous", "vu", "vôtre", "vôtres",
        "y", "à", "ça", "ès", "été", "être", "ô" };

/**
 * Contains the stopwords used with the StopFilter.
 */
private Set stoptable = new HashSet();
/**
 * Contains words that should be indexed but not stemmed.
 */
private Set excltable = new HashSet<String>(Arrays.asList("C", "C++", "C#"));
private String[] exclListe = { "C", "C++", "C#" };

/**
 * Builds an analyzer with the default stop words ({@link #FRENCH_STOP_WORDS}).
 */
public CustomFrenchAnalyzer() {
    setStemExclusionTable(exclListe);
    stoptable = StopFilter.makeStopSet(FRENCH_STOP_WORDS);
}

/**
 * Builds an analyzer with the given stop words.
 */
public CustomFrenchAnalyzer(String[] stopwords) {
    stoptable = StopFilter.makeStopSet(stopwords);
}

/**
 * Builds an analyzer with the given stop words.
 * 
 * @throws IOException
 */
public CustomFrenchAnalyzer(File stopwords) throws IOException {
    stoptable = new HashSet(WordlistLoader.getWordSet(stopwords));
}

/**
 * Builds an exclusionlist from an array of Strings.
 */
public void setStemExclusionTable(String[] exclusionlist) {
    excltable = StopFilter.makeStopSet(exclusionlist);
}

/**
 * Builds an exclusionlist from the words contained in the given file.
 * 
 * @throws IOException
 */
/*
 * public void setStemExclusionTable(File exclusionlist) throws IOException { excltable = new HashSet(WordlistLoader.getWordSet(exclusionlist)); }
 */

/**
 * Creates a TokenStream which tokenizes all the text in the provided Reader.
 * 
 * @return A TokenStream build from a StandardTokenizer filtered with StandardFilter, StopFilter, FrenchStemFilter and LowerCaseFilter
 */
public final TokenStream tokenStream(String fieldName, Reader reader) {

    if (fieldName == null)
        throw new IllegalArgumentException("fieldName must not be null");
    if (reader == null)
        throw new IllegalArgumentException("reader must not be null");

    TokenStream result = new StandardTokenizer(reader);
    result = new StandardFilter(result);
    result = new StopFilter(result, stoptable);
    result = new FrenchStemFilter(result, excltable);
    // Convert to lowercase after stemming!
    result = new LowerCaseFilter(result);
    return result;
}
}

Thanks

Upvotes: 0

femtoRgon
femtoRgon

Reputation: 33351

See FrenchAnalyzer.FRENCH_STOP_WORDS, "c" is a french stop word. You can define your own stop set through the appropriate FrenchAnalyzer constructor.

You can start from the default set and just remove undesirable stop words, when defining your own. The full default french stop set is:

"a", "afin", "ai", "ainsi", "après", "attendu", "au", "aujourd", "auquel", "aussi",
"autre", "autres", "aux", "auxquelles", "auxquels", "avait", "avant", "avec", "avoir",
"c", "car", "ce", "ceci", "cela", "celle", "celles", "celui", "cependant", "certain",
"certaine", "certaines", "certains", "ces", "cet", "cette", "ceux", "chez", "ci",
"combien", "comme", "comment", "concernant", "contre", "d", "dans", "de", "debout",
"dedans", "dehors", "delà", "depuis", "derrière", "des", "désormais", "desquelles",
"desquels", "dessous", "dessus", "devant", "devers", "devra", "divers", "diverse",
"diverses", "doit", "donc", "dont", "du", "duquel", "durant", "dès", "elle", "elles",
"en", "entre", "environ", "est", "et", "etc", "etre", "eu", "eux", "excepté", "hormis",
"hors", "hélas", "hui", "il", "ils", "j", "je", "jusqu", "jusque", "l", "la", "laquelle",
"le", "lequel", "les", "lesquelles", "lesquels", "leur", "leurs", "lorsque", "lui", "là",
"ma", "mais", "malgré", "me", "merci", "mes", "mien", "mienne", "miennes", "miens", "moi",
"moins", "mon", "moyennant", "même", "mêmes", "n", "ne", "ni", "non", "nos", "notre",
"nous", "néanmoins", "nôtre", "nôtres", "on", "ont", "ou", "outre", "où", "par", "parmi",
"partant", "pas", "passé", "pendant", "plein", "plus", "plusieurs", "pour", "pourquoi",
"proche", "près", "puisque", "qu", "quand", "que", "quel", "quelle", "quelles", "quels",
"qui", "quoi", "quoique", "revoici", "revoilà", "s", "sa", "sans", "sauf", "se", "selon",
"seront", "ses", "si", "sien", "sienne", "siennes", "siens", "sinon", "soi", "soit",
"son", "sont", "sous", "suivant", "sur", "ta", "te", "tes", "tien", "tienne", "tiennes",
"tiens", "toi", "ton", "tous", "tout", "toute", "toutes", "tu", "un", "une", "va", "vers",
"voici", "voilà", "vos", "votre", "vous", "vu", "vôtre", "vôtres", "y", "à", "ça", "ès",
"été", "être", "ô"

Upvotes: 0

Related Questions