planet_pluto
planet_pluto

Reputation: 782

Extract top N words that are most similar to an input word from a text file

I have a text file that contains the content of a web page that I have extracted using BeautifulSoup. I need to find N similar words from the text file based on a given word. The process is as follows:

  1. The website from which text was extracted: https://en.wikipedia.org/wiki/Football
  2. The extracted text is saved to a text file.
  3. The User inputs a word, ex: "goal" and I have to display the top N most similar words from the text file.

I have only worked in Computer Vision and completely new to NLP. I'm currently stuck in step 3. I have tried Spacy and Gensim, but my approach is not at all efficient. I currently do this:

for word in ['goal', 'soccer']:
    # 1. compute similarity using spacy for each word in the text file with the given word.
    # 2. sort them based on the scores and choose the top N-words.

Is there any other approach or a simple solution to solve this problem? Any help is appreciated. Thanks!

Upvotes: 3

Views: 2553

Answers (1)

Sergey Bushmanov
Sergey Bushmanov

Reputation: 25249

You can make use of spacy similarity method, that will calculate cosine similarity between tokens for you. In order to use vectors, load a model with vectors:

import spacy
nlp = spacy.load("en_core_web_md")

text = "I have a text file that contains the content of a web page that I have extracted using BeautifulSoup. I need to find N similar words from the text file based on a given word. The process is as follows"
doc = nlp(text)
words = ['goal', 'soccer']

# compute similarity    
similarities = {}   
for word in words:
    tok = nlp(word)
    similarities[tok.text] ={}
    for tok_ in doc:
        similarities[tok.text].update({tok_.text:tok.similarity(tok_)})

# sort
top10 = lambda x: {k: v for k, v in sorted(similarities[x].items(), key=lambda item: item[1], reverse=True)[:10]}

# desired output
top10("goal")
{'need': 0.41729581641359625,
 'that': 0.4156277030017712,
 'to': 0.40102258054859163,
 'is': 0.3742535591719576,
 'the': 0.3735002888862756,
 'The': 0.3735002888862756,
 'given': 0.3595024941701789,
 'process': 0.35218102758578645,
 'have': 0.34597281472837316,
 'as': 0.34433650293640194}

Note, (1) if you're comfortable with gensim, and/or (2) have a word2vec model trained on your text, you can do directly:

word2Vec.most_similar(positive=['goal'], topn=10)

Upvotes: 4

Related Questions