How to compute dot product between each row of two pandas columns with sparse vectors

Question

I have a Pandas dataframe with two columns each of which contains a SciPy sparse vector in every row. Those vectors are rows from csr matrices (so they are actually matrices of shape 1x8500).

I need to create another column which should contain in each of its rows a dot product between the vectors from the first two columns of the same row.

I know how to do this with apply / map on each row, but it takes so long when I'm working on datasets with millions of rows. Is there a much faster way to do this on the entire dataframe?

Apart from dot product I will also need to compute cosine similarity but that may be derived from dot products as far as I understand.

Update: I cannot share the actual data here, but here's a toy example (note that I only have the resulting dataframe for now):

import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
mat = csr_matrix((data, (row, col)), shape=(3, 3))
df = pd.DataFrame({'Col_1': [mat.getrow(i) for i in range(3)],
                   'Col_2': [mat.getrow(i)*2 for i in range(3)]})

I know I could do something like this to calculate the dot product:

df['Col_3'] = df.apply(lambda row: np.dot(row['Col_1'],
                       row['Col_2'].transpose()).toarray()[0][0], axis=1)

But is there a much more efficient way to calculate that Col_3?

How to compute dot product between each row of two pandas columns with sparse vectors

Answers (1)

Related Questions