Build matrix without loops?

Question

I am trying to avoid using loops in the following code because it is slow. I am starting with a list of labels and a list of metrics, with the same length of a few million. I then want to make a symmetric NxN matrix where N is the number of unique label values (about 100). The matrix contains a comparison of the metric. Specifically, the metric is a list of lists and I want to count the number of matching elements in the sublist and update the output matrix with that value. The current code is:

matrix = {}
for i,value_i in enumerate(labels):
    for j,value_j in enumerate(labels):
        if i >= j:       
            matrix[(value_i,value_j)] = matrix.get((value_i,value_j), 0) 
                                        + np.count_nonzero(metrics[i]==metrics[j])
            if i != j:
                matrix[(value_j,value_i)] = matrix[(value_i,value_j)]

I want to do something like list comprehension but also want a dictionary because I update so regularly. For context I have cut this out of more elaborate code here

-------Update--------

Awarding answer to @piRSquared for suggesting use of numba. The gain comes from the use of this package not the use of an array instead of a dict. For comparison the following is 1.29 times slower.

f, u = pd.factorize(labels)
mx = f.max() + 1
matrix = np.zeros((mx, mx), np.int64)
for i in f:
    for j in f:
        if i >= j:
            matrix[i, j] = matrix[i, j] 
                           + np.count_nonzero(metrics[i] == metrics[j])
            if i != j:
                matrix[j, i] = matrix[i, j]
df = pd.DataFrame(matrix, u, u)

piRSquared · Accepted Answer

I'm still messing with this. This is still a loop but I'm using numba to speed it up. I will eventually flesh this post out with more information. However, I wanted to give you something to work with for now.

I have other ideas to speed things up as well.

from string import ascii_uppercase
import numpy as np
import pandas as pd
from numba import njit

@njit
def fill(f, metrics):
    mx = f.max() + 1
    matrix = np.zeros((mx, mx), np.int64)
    for i in f:
        for j in f:
            if i >= j:
                row_i = metrics[i]
                row_j = metrics[j]
                matrix[i, j] = matrix[i, j] + (row_i == row_j).sum()
            if i != j:
                matrix[j, i] = matrix[i, j]
    return matrix


def fill_from_labels(labels, metrics):
    f, u = pd.factorize(labels)
    matrix = fill(f, metrics)
    return pd.DataFrame(matrix, u, u)

df = fill_from_labels(labels, metrics)

Build matrix without loops?

Answers (1)

Related Questions