Reputation: 1613
I have a Pandas dataframe with two columns each of which contains a SciPy sparse vector in every row. Those vectors are rows from csr matrices (so they are actually matrices of shape 1x8500).
I need to create another column which should contain in each of its rows a dot product between the vectors from the first two columns of the same row.
I know how to do this with apply
/ map
on each row, but it takes so long when I'm working on datasets with millions of rows. Is there a much faster way to do this on the entire dataframe?
Apart from dot product I will also need to compute cosine similarity but that may be derived from dot products as far as I understand.
Update: I cannot share the actual data here, but here's a toy example (note that I only have the resulting dataframe for now):
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
mat = csr_matrix((data, (row, col)), shape=(3, 3))
df = pd.DataFrame({'Col_1': [mat.getrow(i) for i in range(3)],
'Col_2': [mat.getrow(i)*2 for i in range(3)]})
I know I could do something like this to calculate the dot product:
df['Col_3'] = df.apply(lambda row: np.dot(row['Col_1'],
row['Col_2'].transpose()).toarray()[0][0], axis=1)
But is there a much more efficient way to calculate that Col_3
?
Upvotes: 1
Views: 2538
Reputation: 231550
With your example
matA = mat
matB = mat*2
col3 = (matA.multiply(matB)).sum(axis=1)
[[ 10]
[ 18]
[154]]
for i in range(3):
print(df['Col_1'][i].A, df['Col_2'][i].A)
[[1 0 2]] [[2 0 4]]
[[0 0 3]] [[0 0 6]]
[[4 5 6]] [[ 8 10 12]]
df['Col_1']
dtype is object, and each element is a csr
matrix, the result of mat.getrow(i)
. The display is a little messy with embedded tabs and newline. The dense equivalent produced with .A
is prettier. Shape is consistent, but the number of nonzero terms varies.
Upvotes: 0