Reputation: 3502
I have a very large domain name dataset. Approx size of the dataset is 1 million.
I want to find similar domains which are duplicate in dataset due to wrong spelling.
So I have been using cosine similarity for finding similar documents.
dataset = ["example.com","examplecom","googl.com","google.com"........]
tfidf_vectorizer = TfidfVectorizer(analyzer="char")
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
cs = cosine_similarity(tfidf_matrix, tfidf_matrix)
Above example is working fine for small dataset but for a large dataset, it is throwing out of memory error.
System Configuration:
1)8GB Ram
2)64 bit system and 64 bit python installed
3)i3-3210 processor
How to find cosine similarity for a large dataset?
Upvotes: 0
Views: 1544
Reputation: 14399
You can use a KDTree
based on normalized inputs to generate cosine distance, as per the answer here. Then it's just a case of setting a minimum distance you want to return (so you don't keep all the larger distances, which is most of the memory you are using) and returning a sparse distance matrix using, for example, a coo_matrix
from scipy.spatial.cKDTree.sparse_distance_matrix
.
Unfortunately I don't have my interpreter handy to code up a full answer right now, but that's the jist of it.
Make sure whatever model you're fitting from that distance matrix can accept sparse inputs, though.
Upvotes: 1