Federico Marchese
Federico Marchese

Reputation: 219

fastest way to perform cosine similarity for 10 million pairs of 1x20 vectors

I have a pandas df of 2 columns each containing 2.7 million rows of normalized vectors of length 20.

I want to take the cosine sim of column1 - row1 vs column2- row1, column1 - row2 vs column2 - row2... so and and so forth until 2.7 million.

I have tried looping but this is extremely slow. What is the fastest way to do this?

here is what im using now:

for index, row in df.iterrows():
   x =  1 - spatial.distance.cosine(tempdf['unit_vector'][index], 
tempdf['ave_unit_vector'][index])
   print(index,x)

data:

tempdf['unit_vector']
Out[185]: 
0          [0.7071067811865475, 0.7071067811865475, 0.0, ...
1          [0.634997029655247, 0.634997029655247, 0.43995...
2          [0.5233710392524532, 0.5233710392524532, 0.552...
3          [0.4792468085399227, 0.4792468085399227, 0.505...
4          [0.4937468195427678, 0.4937468195427678, 0.492...
5          [0.49444897739151283, 0.49444897739151283, 0.5...
6          [0.49548793862403173, 0.49548793862403173, 0.4...
7          [0.5027211862475275, 0.5027211862475275, 0.495...
8          [0.5136216906905179, 0.5136216906905179, 0.489...
9          [0.5035958124287837, 0.5035958124287837, 0.508...
10         [0.5037995208120967, 0.5037995208120967, 0.493...


tempdf['ave_unit_vector']
Out[186]: 
0          [0.5024525269125278, 0.5024525269125278, 0.494...
1          [0.5010905514059507, 0.5010905514059507, 0.499...
2          [0.4993456468410199, 0.4993456468410199, 0.501...
3          [0.5005492367626839, 0.5005492367626839, 0.498...
4          [0.4999384715200533, 0.4999384715200533, 0.501...
5          [0.49836832120891517, 0.49836832120891517, 0.5...
6          [0.49842376222388335, 0.49842376222388335, 0.5...
7          [0.4984869391887457, 0.4984869391887457, 0.500...
8          [0.4990867844970344, 0.4990867844970344, 0.499...
9          [0.49977780370532715, 0.49977780370532715, 0.4...
10         [0.5003161478128204, 0.5003161478128204, 0.499...

This isnt the same dataset but will create a usable df. Columns 'B' and 'C':

df = pd.DataFrame(list(range(0,1000)),columns = ['A'])

for i in range(0,5):
   df['New_{}'.format(i)] = df['A'].shift(i).tolist()

cols = len(df.columns)
start_col = cols - 6

df['B'] = df.iloc[:,start_col:cols].values.tolist()
df['C'] = df['B'] * 2

Upvotes: 3

Views: 2125

Answers (1)

Federico Marchese
Federico Marchese

Reputation: 219

This is the fastest way I have tried. Brought the calculation down from over 30 minutes in a loop to about 5 seconds:

tempdf['vector_mult'] = np.multiply(tempdf['unit_vector'], tempdf['ave_unit_vector'])
tempdf['cosinesim'] = tempdf['vector_mult'].apply(lambda x: sum(x))

This works because my vectors are already unit vectors.

The first function multiplies the vectors in the two columns by row. The second function sums them by row. The challenge here was that no pre-built function wanted to solve row by row. Instead it wanted to sum the vectors in each column then calculate the result.

Upvotes: 3

Related Questions