redrubia
redrubia

Reputation: 2366

Extracting more similar words from a list of words

So I have a list of words describing a particular group. For example, one group is based around pets.

The words for the example group pets, are as follows:

[pets, pet, kitten, cat, cats, kitten, puppies, puppy, dog, dogs, dog walking, begging, catnip, lol, catshit, thug life, poop, lead, leads, bones, garden, mouse, bird, hamster, hamsters, rabbits, rabbit, german shepherd, moggie, mongrel, tomcat, lolcatz, bitch, icanhazcheeseburger, bichon frise, toy dog, poodle, terrier, russell, collie, lab, labrador, persian, siamese, rescue, Celia Hammond, RSPCA, battersea dogs home, rescue home, battersea cats home, animal rescue, vets, vet, supervet, Steve Irwin, pugs, collar, worming, fleas, ginger, maine coon, smelly cat, cat people, dog person, Calvin and Hobbes, Calvin & Hobbes, cat litter, catflap, cat flap, scratching post, chew toy, squeaky toy, pets at home, cruft's, crufts, corgi, best in show, animals, Manchester dogs' home, manchester dogs home, cocker spaniel, labradoodle, spaniel, sheepdog, Himalayan, chinchilla, tabby, bobcat, ragdoll, short hair, long hair, tabby cat, calico, tabbies, looking for a good home, neutring, missing, spayed, neutered, declawing, deworming, declawed, pet insurance, pet plan, guinea pig, guinea pigs, ferret, hedgehogs, minipigs, mastiff, leonburger, great dane, four-legged friend, walkies, goldfish, terrapin, whiskas, mr dog, sheba, iams]

Now I plan on enriching this list using NLTK.

So as a start I can get the synset of each word. If we take cats, as an example we obtain:

Synset('cat.n.01')
Synset('guy.n.01')
Synset('cat.n.03')
Synset('kat.n.01')
Synset('cat-o'-nine-tails.n.01')
Synset('caterpillar.n.02')
Synset('big_cat.n.01')
Synset('computerized_tomography.n.01')
Synset('cat.v.01')
Synset('vomit.v.01')

For this we user nltk's wordnet, from nltk.corpus import wordnet as wn.

We can then obtain the lemmas for each synset. By simply adding these lemma's I inturn add quite a bit of noise, how ever I also add some interesting words.

But what I would like to look at is noise reduction, and would appreciate any suggestions or alternate methods to the above.

One such idea, I am trying is to see if the word 'cats' appears in the synset name or definition, to include or exclude those lemmas.

Upvotes: 3

Views: 561

Answers (1)

Nikita Astrakhantsev
Nikita Astrakhantsev

Reputation: 4749

I'd propose to use semantic similarity here with a variant of kNN: for each candidate word compute pairwise semantic similarity to all gold-standard words, then keep only k (try different k from 5 to 100) most similar gold-standard words, compute average (or sum) of similarities to these k words and then use this value in order to discard noise candidates - by sorting and keeping only n best, or by cut-off by experimentally defined threshold.

Semantic similarity can be computed on the basis of WordNet, see related question, or on the basis of vector models learned by word2vec or similar techniques, see related question again.

Actually, you can try to use this technique with all words as candidates, or all/some words occurring in domain-specific texts - in the last case the task is called automatic term recognition and methods can be used for your problem directly or as a source of candidates; search for them on Google scholar; as an example with short description of existed approaches and links to surveys see this paper:

Fedorenko, D., Astrakhantsev, N., & Turdakov, D. (2013). Automatic recognition of domain-specific terms: an experimental evaluation. In SYRCoDIS (pp. 15-23).

Upvotes: 2

Related Questions