Reputation: 4924
I am trying to avoid using loops in the following code because it is slow. I am starting with a list of labels and a list of metrics, with the same length of a few million. I then want to make a symmetric NxN matrix where N is the number of unique label values (about 100). The matrix contains a comparison of the metric. Specifically, the metric is a list of lists and I want to count the number of matching elements in the sublist and update the output matrix with that value. The current code is:
matrix = {}
for i,value_i in enumerate(labels):
for j,value_j in enumerate(labels):
if i >= j:
matrix[(value_i,value_j)] = matrix.get((value_i,value_j), 0)
+ np.count_nonzero(metrics[i]==metrics[j])
if i != j:
matrix[(value_j,value_i)] = matrix[(value_i,value_j)]
I want to do something like list comprehension but also want a dictionary because I update so regularly. For context I have cut this out of more elaborate code here
-------Update--------
Awarding answer to @piRSquared for suggesting use of numba. The gain comes from the use of this package not the use of an array instead of a dict. For comparison the following is 1.29 times slower.
f, u = pd.factorize(labels)
mx = f.max() + 1
matrix = np.zeros((mx, mx), np.int64)
for i in f:
for j in f:
if i >= j:
matrix[i, j] = matrix[i, j]
+ np.count_nonzero(metrics[i] == metrics[j])
if i != j:
matrix[j, i] = matrix[i, j]
df = pd.DataFrame(matrix, u, u)
Upvotes: 1
Views: 1696
Reputation: 294478
I'm still messing with this. This is still a loop but I'm using numba
to speed it up. I will eventually flesh this post out with more information. However, I wanted to give you something to work with for now.
I have other ideas to speed things up as well.
from string import ascii_uppercase
import numpy as np
import pandas as pd
from numba import njit
@njit
def fill(f, metrics):
mx = f.max() + 1
matrix = np.zeros((mx, mx), np.int64)
for i in f:
for j in f:
if i >= j:
row_i = metrics[i]
row_j = metrics[j]
matrix[i, j] = matrix[i, j] + (row_i == row_j).sum()
if i != j:
matrix[j, i] = matrix[i, j]
return matrix
def fill_from_labels(labels, metrics):
f, u = pd.factorize(labels)
matrix = fill(f, metrics)
return pd.DataFrame(matrix, u, u)
df = fill_from_labels(labels, metrics)
Upvotes: 1