Reputation: 310
A particular natural language practice is to calculate the similarity between two words using WordNet. I start my question with the following python code:
from nltk.corpus import wordnet
sport = wordnet.synsets("sport")[0]
badminton = wordnet.synsets("badminton")[0]
print(sport.wup_similarity(badminton))
We will get 0.8421
Now what if I look for "haha" and "lol" as following:
haha = wordnet.synsets("haha")
lol = wordnet.synsets("lol")
print(haha)
print(lol)
We will get
[]
[]
Then we cannot consider the similarity between them. What can we do in this case?
Upvotes: 6
Views: 1735
Reputation: 894
There are different models for measuring similarity, such as word2vec or glove, but you seem to be looking more for a corpus which includes social, informal phrases like 'lol'.
However, I'm going to bring up word2vec because it leads to what I think is an answer to your question.
The foundational concept of word2vec (and other word embedding models like glove) is the representation of words in a vector space which incorporates relationships between words. This lends itself very well to measuring similarity, since vectors have lots of established math to draw from. You can read more about the technical details of word2vec in the original paper, but I quite like this blog post because it is well-written and concise.
Again, since word2vec is just a model, you need to pair it with the right training set to get the kind of scope you seek. There are some pre-trained models floating around on the web, such as this bunch. The training set is really what allows you to query a larger variety of terms, rather than the model.
You can certainly use those pre-trained models if they have social phrases like the ones you're seeking. However, if you don't see a model that has been trained on a suitable corpus, you could easily train a model yourself. I suggest Twitter or Wikipedia for corpora (training sets), and the implementation of word2vec in gensim
as a word embedding model.
Upvotes: 2
Reputation: 778
You can use other frameworks. I was trying also NLTK but finally landed on spacy (spacy.io) very fast and functional framework. There is a method for words called 'similarity' which compers to other words(but it works also for sentences, docs etc). It is implemented using word2vec. Actually I don't know how big is their vocabulary and how it struggle in case the word is unknown but it might be worth to try.
I was also playing a little bit with this one: https://radimrehurek.com/gensim/models/word2vec.html Where in 2 lines you can load google's big word2vec model(this project ports google word2vec c++ library into python) accessible here: https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit
Upvotes: 2
Reputation: 1351
There are two possible other ways:
CBOW: continuous bag of word
skip gram model: This model is vice versa of CBOW model
look at this: https://www.quora.com/What-are-the-continuous-bag-of-words-and-skip-gram-architectures-in-laymans-terms
These model are well represted here: https://www.tensorflow.org/tutorials/word2vec, also GENSIM is a good python library for doing such these things
Try to look for Tensorflow Solutions, For example this: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py
Or try to look for word2vec: https://en.wikipedia.org/wiki/Word2vec
Upvotes: 5
Reputation: 405
You can create a semantic space from cooccurrence matrices using a tool like Dissect (DIStributional SEmantics Composition Toolkit) and then you are set to measure semantic similarity between words or phrases (if you compose words).
In your case for ha and lol you'll need to collect those cooccurrences.
Another thing to try is word2vec.
Upvotes: 7