Reputation: 391
i need to find out semantic similarity / relatedness between two input words. Following words are similar or related in real world:-
- genuineness, genuine, genuinely, valid, reality, fact, really
- painter, painting, paint
Following is my code snipped that i took from here
ILexicalDatabase db = new NictWordNet();
RelatednessCalculator lin = new Lin(db);
RelatednessCalculator wup = new WuPalmer(db);
RelatednessCalculator path = new Path(db);
String w1 = "truth";
String w2 = "genuine";
System.out.println(lin.calcRelatednessOfWords(w1, w2));
System.out.println(wup.calcRelatednessOfWords(w1, w2));
System.out.println(path.calcRelatednessOfWords(w1, w2));
i am using WS4J Api (ws4j1.0.1.jar) with java 1.7 in eclipse 3.4. i am getting following results that makes no sense or may be my perception is wrong.
If my approach is wrong, please let me know if i want to work out similarity between words, then what other api i should have been using.
Upvotes: 1
Views: 2108
Reputation: 3818
It looks like the words are not found in the dataset you have configured, and so it simply returns a score of 0.0
: For example, the following nonsense words result in a score of 0.0
as well:
ILexicalDatabase db = new NictWordNet();
RelatednessCalculator lin = new Lin(db);
RelatednessCalculator wup = new WuPalmer(db);
RelatednessCalculator path = new Path(db);
String w1 = "iamatotallycompletelyfakewordwithagermanwordinsidevergnügen";
String w2 = "iamevenmorefakeandstrangerossiskajafoderatsija";
System.out.println(lin.calcRelatednessOfWords(w1, w2));
System.out.println(wup.calcRelatednessOfWords(w1, w2));
System.out.println(path.calcRelatednessOfWords(w1, w2));
Unfortunately, I can't tell what your configuration is like, and the link you supplied does not seem to work (any more, at least). However, the JAR for ws4j 1.0.1 at Google Code includes its own information content file (named ic-semcor.dat) which is configured in the file similarity.conf:
# ----------------------------------------------------------------------
# The following option is supported by :
# res, lin, jcn
infocontent = ic-semcor.dat
# Specifies the name of an information content file under
# data/. The value of this option must be the name of a
# file, or a relative or absolute path name. The default
# value of this option ic-semcor.dat.
Using this setup, I get the same results for the words you listed in your table. Therefore, you should look more into the configuration of the individual RelatednessCalculator
implementations for the different metrics.
Upvotes: 1