dapo
dapo

Reputation: 717

Classify words with the same meaning

I have 50.000 subject lines from emails and i want to classify the words in them based on synonyms or words that can be used instead of others.

For example:

Top sales!

Best sales

I want them to be in the same group.

I build the following function with nltk's wordnet but it doesn't work well.

def synonyms(w,group,guide):
    try:
         # Check if the words is similar
        w1 = wordnet.synset(w +'.'+guide+'.01')
        w2 = wordnet.synset(group +'.'+guide+'.01')

        if w1.wup_similarity(w2)>=0.7:
             return True

        elif w1.wup_similarity(w2)<0.7:
            return False

    except:
         return False

Any ideas or tools to accomplish this?

Upvotes: 8

Views: 5146

Answers (3)

aerin
aerin

Reputation: 22724

The computation behind what Nick said is to calculate the distance (cosine distance) between two phrases vectors.

Top sales!
Best sales

Here is one way to do so: How to calculate phrase similarity between phrases

Upvotes: 1

Nick Hough
Nick Hough

Reputation: 56

The easiest way to accomplish this would be to compare the similarity of the respective word embeddings (the most common implementation of this is Word2Vec).

Word2Vec is a way of representing the semantic meaning of a token in a vector space, which enables the meanings of words to be compared without requiring a large dictionary/thesaurus like WordNet.

One problem with regular implementations of Word2Vec is that it does differentiate between different senses of the same word. For example, the word bank would have the same Word2Vec representation in all of these sentences:

  • The river bank was dry.
  • The bank loaned money to me.
  • The plane may bank to the left.

Bank has the same vector in each of these cases, but you may want them to be sorted into different groups.

One way to solve this is to use a Sense2Vec implementation. Sense2Vec models take into account the context and part of speech (and potentially other features) of the token, allowing you to differentiate between the meanings of different senses of the word.

A great library for this in Python is Spacy. It is like NLTK, but much faster as it is written in Cython (20x faster for tokenization and 400x faster for tagging). It also has Sense2Vec embeddings inbuilt, so you can accomplish your similarity task without needing other libraries.

It's as simple as:

import spacy

nlp = spacy.load('en') 

apples, and_, oranges = nlp(u'apples and oranges')
apples.similarity(oranges)

It's free and has a liberal license!

Upvotes: 4

Luis Leal
Luis Leal

Reputation: 3534

An idea is to solve this with embeddings and word2vec , the outcome will be a mapping from words to vectors which are "near" when they have similar meanings, for example "car" and "vehicle" will be near and "car" and "food" will not, you can then measure the vector distance between 2 words and define a threshold to select if they are so near that they mean the same, as i said its just an idea of word2vec

Upvotes: 1

Related Questions