Reputation: 177
I am working on creating a function which will calculate the cosine similarity of each record in a dataset (MxK dimension) against records in another dataset (NxK dimension) where N is much smaller than M.
The below code does the job well when I test it on a tiny dataset ('iris' dataset for example). I am worried it might struggle when I have bigger datasets ( 100K records & 100+ variables).
I know for loop is not advisable for such scenarios and I got two for loops in this case. I am wondering if anyone can suggest ways of improving this code.
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def similarity_calculation(seed_data, pool_data):
# Create an empty dataframe to store the similarity scores
similarity_matrix = pd.DataFrame()
for indexi, rowi in pool_data.iterrows():
# Create an array to score similarity score for each record in pool data
similarity_score_array = []
for indexj, rowj in seed_data.iterrows():
# Fetch a single record from pool dataset
pool = rowi.values.reshape(1, -1)
# Fetch a single record from seed dataset
seed = rowj.values.reshape(1, -1)
# Measure similarity score between the two records
similarity_score = (cosine_similarity(pool, seed))[0][0]
similarity_score_array.append(similarity_score)
# Append the similarity score array as a new record to the similarity matrix
similarity_matrix = similarity_matrix.append(pd.Series(similarity_score_array), ignore_index=True)
Edit1: Sample data iris dataset is used as follows
iris_data = pd.read_csv("iris_data.csv", header=0)
# Split the data into seeds and pool sets, excluding the species details
seed_set = iris_data.iloc[:10, :4]
pool_set = iris_data.iloc[10:, :4]
My new compact code (with a single for loop) is as follows
def similarity_calculation_compact(seed_data, pool_data):
Array1 = pool_data.values
Array2 = seed_data.values
scores = []
for i in range(Array1.shape[0]):
scores.append(np.mean(cosine_similarity(Array1[None, i, :], Array2)))
final_data = pool_data.copy()
final_data['mean_similarity_score'] = scores
final_data = final_data.sort_values(by='mean_similarity_score', ascending=False)
return(final_data)
I was expecting identical results as both functions are supposed to fetch records from pool data most similar (in terms of average cosine similarity) to the seed data.
Upvotes: 1
Views: 115
Reputation: 9806
There is no need for the for-loops, since cosine_similarity
takes as input two arrays of shapes (n_samples_X, n_features)
and (n_samples_Y, n_features)
and returns an array of shape (n_samples_X, n_samples_Y)
by computing cosine similarity between each pair of the two input arrays.
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
iris_data = pd.read_csv("iris.csv", header=0)
seed_set = iris_data.iloc[:10, :4]
pool_set = iris_data.iloc[10:, :4]
np.mean(cosine_similarity(pool_set, seed_set), axis=1)
Result (after sorting):
array([0.99952255, 0.99947777, 0.99947545, 0.99946886, 0.99946596, ...])
Upvotes: 1