Kristada673
Kristada673

Reputation: 3744

Given a list of words, how to develop an algorithmic way to semantically group them?

I am working with the Google Places API, and they contain a list of 97 different locations. I want to reduce the list of locations into a lesser number of them, as many of them are groupable. For example, atm and bank into financial; temple, church, mosque, synagogue into worship; school, university into education; subway_station, train_station, transit_station, gas_station into transportation.

But also, it should not overgeneralize; for example, pet_store, city_hall, courthouse, restaurant into something like buildings.

I tried quite a few methods to do this. First I downloaded synonyms of each of the 97 words in the list from multiple dictionaries. Then, I found out the similarity between 2 words based on what fraction of unique synonyms they share in common (Jaccard similarity):

enter image description here

But after that, how do I group words into clusters? Using traditional clustering methods (k-means, k-medoid, hierarchical clustering, and FCM), I am not getting any good clustering (I identified several misclassifications by scanning the results manually):

enter image description here enter image description here

I even tried the word2vec model trained on Google news data (where each word is expressed as a vector of 300 features), and I do not get good clusters based on that as well: enter image description here

Upvotes: 1

Views: 186

Answers (2)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

You need much more data.

No algorithm ever, without additional data, will relate ATM and bank to financial. Because that requires knowledge of these terms.

Jaccard similarity doesn't have access to such knowledge, it can only work on the words. And then "river bank" and "bank branch" are very similar.

So don't expect magic to happen by the algorithm. You need the magic to be in the data...

Upvotes: 1

stackoverflowuser2010
stackoverflowuser2010

Reputation: 40889

You are probably looking for something related to vector space dimensionality reduction. In these techniques, you'll need a corpus of text that uses the locations as words in the text. Dimensionality reduction will then group the terms together. You can do some reading on Latent Dirichlet Allocation and Latent semantic indexing. A good reference is "Introduction to Information Retrieval" by Manning et al., chapter 18. Note that this book is from 2009, so a lot of advances are not captured. As you noted, there has been a lot of work such as word2vec. Another good reference is "Speech and Language Processing" by Jurafsky and Martin, chapter 16.

Upvotes: 1

Related Questions