NightWolf
NightWolf

Reputation: 7794

Word Net - Word Synonyms & related word constructs - Java or Python

I am looking to use WordNet to look for a collection of like terms from a base set of terms.

For example, the word 'discouraged' - potential synonyms could be: daunted, glum, deterred, pessimistic.

I also wanted to identify potential bi-grams such as; beat down, put off, caved in etc.

How do I go about extracting this information using Java or Python? Are there any hosted WordNet databases/web interfaces which would allow such querying?

Thanks!

Upvotes: 6

Views: 2632

Answers (3)

user502187
user502187

Reputation:

It is easiest to understand the WordNet data by looking at the Prolog files. They are documented here:

http://wordnet.princeton.edu/wordnet/man/prologdb.5WN.html

WordNet terms are group into synsets. A synset is a maximal synonym set. Synsets have a primary key so that they can be used in semantic relationships.

So answering your first question, you can list the different senses and corresponding synonyms of a word as follows:

Input X: Term
Output Y: Sense  
Output L: Synonyms in this Sense  

s_helper(X,Y) :- s(X,_,Y,_,_,_).  
?- setof(H,(s_helper(Y,X),s_helper(Y,H)),L).  

Example:

?- setof(H,(s_helper(Y,'discouraged'),s_helper(Y,H),L).  
Y = 301664880,  
L = [demoralised, demoralized, discouraged, disheartened] ;  
Y = 301992418,  
L = [discouraged] ;  
No  

For the second part of your question, WordNet terms are sequences of words. So you can search this WordNet terms for words as follows:

Input X: Word  
Output Y: Term

s_helper(X) :- s(_,_,X,_,_,_).  
word_in_term(X,Y) :- atom_concat(X,' ',H), sub_atom(Y,0,_,_,H).
word_in_term(X,Y) :- atom_concat(' ',X,H), atom_concat(H,' ',J), sub_atom(Y,_,_,_,J).
word_in_term(X,Y) :- atom_concat(' ',X,H), sub_atom(Y,_,_,0,H).
?- s_helper(Y), word_in_term(X,Y).

Example:

?- s_helper(X), word_in_term('beat',X).  
X = 'beat generation' ;  
X = 'beat in' ;  
X = 'beat about' ;  
X = 'beat around the bush' ;  
X = 'beat out' ;  
X = 'beat up' ;  
X = 'beat up' ;  
X = 'beat back' ;  
X = 'beat out' ;  
X = 'beat down' ;  
X = 'beat a retreat' ;  
X = 'beat down' ;  
X = 'beat down' ;  
No

This would give you potential n-grams, but no so much morphological variation. WordNet does also exhibit some lexical relations, which could be useful.

But both Prolog queries I have given are not very efficient. The problem is the lack of some word indexing. A Java implementation could of course implement something better. Just imagine something along:

class Synset {  
    static Hashtable<Integer,Synset> synset_access;  
    static Hashtable<String,Vector<Synset>> term_access;  
}

Some Prolog can do the same, by a indexing directive, it is possible to instruct the Prolog system to index on multiple arguments for a predicate.

Putting up a web service shouldn't be that difficult, either in Java or Prolog. Many Prologs systems easily allow embedding Prolog programs in web servers, and Java champions servlets.

A list of Prologs that support web servers can be found here:

http://en.wikipedia.org/wiki/Comparison_of_Prolog_implementations#Operating_system_and_Web-related_features

Best Regards

Upvotes: 3

Lev Khomich
Lev Khomich

Reputation: 2247

As alternative to NLTK, you can use one of available WordNet SPARQL endpoints to retrieve such information. Query example:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wordnet: <http://www.w3.org/2006/03/wn/wn20/schema/>

SELECT DISTINCT ?label {
  ?input_word a wordnet:WordSense;
     rdfs:label ?input_label.
  FILTER (?input_label = 'run')
  ?synset wordnet:containsWordSense ?input_word.
  ?synset wordnet:containsWordSense ?synonym.
  ?synonym rdfs:label ?label.
} LIMIT 100

In Java universe, Jena and Sesame frameworks can be used.

Upvotes: 2

nflacco
nflacco

Reputation: 5082

These are two different problems.

1) Wordnet and python. Use NLTK, it has a nice interface to wordnet. You could write something on your own, but honestly why make life difficult? Lingpipe probably also has something built in but NLTK is much easier to use. I think nltk just downloads an ntlk database, but I'm pretty sure there are apis to talk to wordnet.

2) To get bigrams in nltk follow this tutorial. In general you tokenize text and then just iterate over the sentence getting all the n-grams for each word by looking forward and backward.

Upvotes: 3

Related Questions