Reputation: 79
I have a collection of newspaper (unlabeled, just raw articles) articles on a disease. I also have three set of manually chosen keywords associated with the disease eg: phase-1
,phase-2
etc like below.
phase_1 = ["symptoms","signs","fever","ache","vomit","blood","headache","fatigue","breath"]
phase_2 = ["pathogen","flavivirus","swamp","virus","contagious","mosquito bite","virus","agent","host"]
Is there anyways to calculate the similarities between a set of keywords and the news articles, using PYTHON ?
Upvotes: 1
Views: 2347
Reputation: 45
You can define various similiarity metrics for such a task. You can then go to see which one works best. Here are some ideas:
1.) As pointed out in the post by Max, you can calculate the Jaccard index between the document and each of the two lists. The Jaccard index is defined as the intersection divided by the union of the two items:
set1 = set(news_article.split())
set2 = set(phase_1)
jcc = len(set1.intersection(set2)) / len(set1.union(set2))
The higher the jaccard index, the more similar the text is with the list. However, the Jaccard index would only work if your news articles contain exactly the words that you have defined in your lists. A text that contains words that are semantically similar, but different to the ones in your list would still have a jaccard index of 0, even though it has similar words in it.
2.) I recommend to also try out a slightly more advanced approach based on the Word Mover's Distance (WMD). For this you would need a representation of your words in some vector space (e.g. obtained by a word2vec model). Then, you can represent a news article and one of your lists as a collection of vectors in that space. The metric measures how different the two representations are (how much do you have to move one representation to match the other). The smaller the distance the more similar the two representations.
You can probably train your word2vec model on your news article. I'd suggest to use gensim for training the model and to later evaluate the word mover's distance.
https://radimrehurek.com/gensim/auto_examples/tutorials/run_wmd.html
It's not guaranteed to work, but I'd give it a shot. In my experience the WMD usually works better than cosine distance, but it of course depends on the application.
Both approaches would also depend on the text processing that you do beforehand. Make sure that your news article will be in the format that you expect before evaluating the metrics/training the word2vec model.
Upvotes: 1
Reputation: 286
When naming your lists, I don't recommend using dashes or other top characters. I hope this helps:
phase_1 = ["symptoms","signs","fever","ache","vomit","blood","headache","fatigue","breath"]
phase_2 = ["pathogen","flavivirus","swamp","virus","contagious","mosquito bite","virus","agent","host"]
# Performing the calculations
res = len(set(phase_1) & set(phase_2))
res2 = res / float(len(set(phase_1) | set(phase_2))) * 100
# Showing the results
msg = "The percentage of smilarity between the 2 lists is:"
print(msg, res2)
Upvotes: 1