Calculating number of intersecting columns for each row with each other row from boolean numpy array

Question

I have written a short program to extract association rules from a database. However, now I want to compare the extracted rules and calculate the number of intersecting attributes from each rule with each other. The rule-conditions are represented in a boolean numpy array, where each row can be considered as the antecedent of one association rule. The columns of the boolean array are representing the attributes, which may occur in the antecedent of the rules. ("True" indicates, that the attribute is present in the rule). Example input array:

encoded_rules = np.array([[True,True,False,False,True],
                         [False,False,True,False,False],
                         [True,True,False,True,True],
                         [False,True,False,False,False]])

desired output array:

[[3 0 3 1]
[0 1 0 0]
[3 0 4 1]
[1 0 1 1]]

As you can see the rule at index 0 has 3 attributes and the number of intersecting attributes with the rule at index 2 (which has 4 attributes) is 3. To achive this goal I have tried multiple approaches, but I couldn't figure out how to do it in a efficient vectorized manner. My current solution is a for-loop in which i create the output array step by step, using lower and upper bounds. The lower and upper bounds are calculated by counting the number of occurrences of the attributes.:

encoded_rules = np.array([[True,True,False,False,True],
                         [False,False,True,False,False],
                         [True,True,False,True,True],
                         [False,True,False,False,False]])

rule_count = encoded_rules.shape[0]
rules, attributes = encoded_rules.nonzero()

#empty output array, which gets filled in the for-loop:
rule_mat = np.zeros((rule_count, rule_count), dtype="int")

sort_inds = attributes.argsort()
rules, attributes = rules[sort_inds], attributes[sort_inds]


unique_attributes, counts = np.unique(attributes, return_counts=True)
lower_bound = 0
upper_bound = 0

for attribute in unique_attributes:
    upper_bound += counts[attribute]
    intersecting_rules = rules[lower_bound:upper_bound]
    rule_mat[intersecting_rules[:,None], intersecting_rules] +=1
    lower_bound += counts[attribute]

print(rule_mat)

I'm working with a quite large array (the array, which contains the encoded rules has dimension of ~ 18.000 rows x 42 columns and on average each row has 5-6 columns whose values evaluates to "True") so on my system this procedure took about 4 seconds. (almost 0.1 second per iteration). Depending on the data, the array can even get larger.

What I now want to achieve is to get rid of the for-loop. Therefore I developed the vectorized code below. Unfortunately for big arrays this approach is about 25 times slower than using the for-loop. The basic idea of the vectorized version is to calculate the logical-and of each rule with each other and then take the sum over the columns:

encoded_rules = np.array([[True,True,False,False,True],
                         [False,False,True,False,False],
                         [True,True,False,True,True],
                         [False,True,False,False,False]])

rule_count = encoded_rules.shape[0]
encoded_rules_3d = np.stack([encoded_rules]*rule_count, axis=0)
rule_mat = encoded_rules_3d & encoded_rules[:,None]
rule_mat = rule_mat.sum(axis=2)

print(rule_mat)

Does somebody have any idea to speed up the vectorized calculation?

Lior Cohen · Accepted Answer

You can achieve this with simple linear algebra.

Imagine that every True is 1 and every False is 0

So multiply this integers matrix a by its transpose a.T will give you what you want.

Of course the numpy is vectorized.

import numpy as np

a = np.array([[True, True, False, False, True],
              [False, False, True, False, False],
              [True, True, False, True, True],
              [False, True, False, False, False]])

a_int = a.astype(int)
print(a_int @ a_int.T)

[[3 0 3 1]
 [0 1 0 0]
 [3 0 4 1]
 [1 0 1 1]]

Calculating number of intersecting columns for each row with each other row from boolean numpy array

Answers (1)

Related Questions