Reputation: 6488
I have a following data frame df
, which I converted from sframe
URI name text
0 <http://dbpedia.org/resource/Digby_M... Digby Morrell digby morrell born 10 october 1979 i...
1 <http://dbpedia.org/resource/Alfred_... Alfred J. Lewy alfred j lewy aka sandy lewy graduat...
2 <http://dbpedia.org/resource/Harpdog... Harpdog Brown harpdog brown is a singer and harmon...
3 <http://dbpedia.org/resource/Franz_R... Franz Rottensteiner franz rottensteiner born in waidmann...
4 <http://dbpedia.org/resource/G-Enka> G-Enka henry krvits born 30 december 1974 i...
I have done the following:
from textblob import TextBlob as tb
import math
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob.words)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
bloblist = []
for i in range(0, df.shape[0]):
bloblist.append(tb(df.iloc[i,2]))
for i, blob in enumerate(bloblist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:3]:
print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
But this is taking a lot of time as there are 59000
documents.
Is there a better way to do it?
Upvotes: 0
Views: 1084
Reputation: 335
I am confused about this subject. But I found a few solution on the internet with use Spark. Here you can look at:
https://www.linkedin.com/pulse/understanding-tf-idf-first-principle-computation-apache-asimadi
On the other hand i tried theese method and i didn't get bad results. Maybe you want to try :
With theese i got theese result :
Before normalization word vector length : 11880
Mean : 19 lower bound : 9 upper bound : 95
After normalization word vector length : 1595
And also cosine similarity results were better.
Upvotes: 1