Malte
Malte

Reputation: 101

Can I avoid a looping through a vector for the same operation (Python)?

I have a vector and have to do the same calculation for every element of the vector. Its a Similarity measure for categorical data that need the frequency of a given value. In this case I want to calculate the distance between two points a and b of x

x = np.array([[1, 1, 1],
              [2, 1, 1]])
a = x[0]; b= x[1]

pairwise_dist = []
for i in range(len(a)):
    freq_a = sum(x[:, i] == a[i])
    freq_b = sum(x[:, i] == b[i])

    match = 1
    missmatch = 1/(1 + np.log(freq_a)*np.log(freq_b))
            
    pairwise_dist.append((a[i] == b[i]) * match + (a[i] != b[i]) * missmatch)

dist = sum(pairwise_dist)/len(a)

Its always the same operations and its always only the i-th element of the vector. Shouldn't it be possible to vectorize this?

Upvotes: 0

Views: 134

Answers (1)

Blckknght
Blckknght

Reputation: 104722

Numpy's broadcasting can handle pretty much all of what your loop does. The only special behavior you need is to use numpy.sum with an axis argument when computing the freq_a and freq_b values, since you don't want to sum over all the axes at once (since that would give you a scalar, not a vector).

x = np.array([[1, 1, 1],
              [2, 1, 1]])
a = x[0]
b = x[1]

freq_a = np.sum(x == a, axis=0)
freq_b = np.sum(x == b, axis=0)

match = 1
missmatch = 1/(1 + np.log(freq_a)*np.log(freq_b))

pairwise_dist = (a == b) * match + (a != b) * missmatch

dist = np.sum(pairwise_dist)/len(a)

Note that while this is an accurate translation of your existing code, I'm not sure it does anything useful. It always computes a dist of 1.0, regardless of input. But so does the original code, so I guess it's bug-compatible!

Upvotes: 2

Related Questions