Sergey Zakharov
Sergey Zakharov

Reputation: 1613

How to compute dot product between each row of two pandas columns with sparse vectors

I have a Pandas dataframe with two columns each of which contains a SciPy sparse vector in every row. Those vectors are rows from csr matrices (so they are actually matrices of shape 1x8500).

I need to create another column which should contain in each of its rows a dot product between the vectors from the first two columns of the same row.

I know how to do this with apply / map on each row, but it takes so long when I'm working on datasets with millions of rows. Is there a much faster way to do this on the entire dataframe?

Apart from dot product I will also need to compute cosine similarity but that may be derived from dot products as far as I understand.

Update: I cannot share the actual data here, but here's a toy example (note that I only have the resulting dataframe for now):

import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
mat = csr_matrix((data, (row, col)), shape=(3, 3))
df = pd.DataFrame({'Col_1': [mat.getrow(i) for i in range(3)],
                   'Col_2': [mat.getrow(i)*2 for i in range(3)]})

I know I could do something like this to calculate the dot product:

df['Col_3'] = df.apply(lambda row: np.dot(row['Col_1'],
                       row['Col_2'].transpose()).toarray()[0][0], axis=1)

But is there a much more efficient way to calculate that Col_3?

Upvotes: 1

Views: 2538

Answers (1)

hpaulj
hpaulj

Reputation: 231550

With your example

matA = mat
matB = mat*2
col3 = (matA.multiply(matB)).sum(axis=1)

[[ 10]
 [ 18]
 [154]]

for i in range(3):
    print(df['Col_1'][i].A, df['Col_2'][i].A)
[[1 0 2]] [[2 0 4]]
[[0 0 3]] [[0 0 6]]
[[4 5 6]] [[ 8 10 12]]

df['Col_1'] dtype is object, and each element is a csr matrix, the result of mat.getrow(i). The display is a little messy with embedded tabs and newline. The dense equivalent produced with .A is prettier. Shape is consistent, but the number of nonzero terms varies.

Upvotes: 0

Related Questions