Rakesh Chaudhari
Rakesh Chaudhari

Reputation: 3502

How to find cosine similarity for of a very Large Array

I have a very large domain name dataset. Approx size of the dataset is 1 million.

I want to find similar domains which are duplicate in dataset due to wrong spelling.

So I have been using cosine similarity for finding similar documents.

dataset = ["example.com","examplecom","googl.com","google.com"........]
tfidf_vectorizer = TfidfVectorizer(analyzer="char")
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
cs = cosine_similarity(tfidf_matrix, tfidf_matrix)

Above example is working fine for small dataset but for a large dataset, it is throwing out of memory error.

System Configuration:

1)8GB Ram

2)64 bit system and 64 bit python installed

3)i3-3210 processor

How to find cosine similarity for a large dataset?

Upvotes: 0

Views: 1544

Answers (1)

Daniel F
Daniel F

Reputation: 14399

You can use a KDTree based on normalized inputs to generate cosine distance, as per the answer here. Then it's just a case of setting a minimum distance you want to return (so you don't keep all the larger distances, which is most of the memory you are using) and returning a sparse distance matrix using, for example, a coo_matrix from scipy.spatial.cKDTree.sparse_distance_matrix.

Unfortunately I don't have my interpreter handy to code up a full answer right now, but that's the jist of it.

Make sure whatever model you're fitting from that distance matrix can accept sparse inputs, though.

Upvotes: 1

Related Questions