How to vectorize this operation

Question

Say I have two lists (always the same length):

l0 = [0, 4, 4, 4, 0, 0, 0, 8, 8, 0] 
l1 = [0, 1, 1, 1, 0, 0, 0, 8, 8, 8]

I have the following rules for intersections and unions I need to apply when comparing these lists element-wise:

# union and intersect
uni = [0]*len(l0)
intersec = [0]*len(l0)
for i in range(len(l0)):
    if l0[i] == l1[i]:
        uni[i] = l0[i]
        intersec[i] = l0[i]
    else:
        intersec[i] = 0  
        if l0[i] == 0:
            uni[i] = l1[i]
        elif l1[i] == 0:
            uni[i] = l0[i]
        else:
            uni[i] = [l0[i], l1[i]]

Thus, the desired output is:

uni: [0, [4, 1], [4, 1], [4, 1], 0, 0, 0, 8, 8, 8] 
intersec: [0, 0, 0, 0, 0, 0, 0, 8, 8, 0]

While this works, I need to do this with several hundred very large lists (each, with thousands of elements), so I am looking for a way to vectorize this. I tried using np.where and various masking strategies, but that went nowhere fast. Any suggestions would be most welcome.

* EDIT *

Regarding

uni: [0, [4, 1], [4, 1], [4, 1], 0, 0, 0, 8, 8, 8]

versus

uni: [0, [4, 1], [4, 1], [4, 1], 0, 0, 0, 8, 8, [0, 8]]

I'm still fighting the 8 versus [0, 8] in my mind. The lists are derived from BIO tags in system annotations (see IOB labeling of text chunks), where each list element is a character index in a document and the vakue is an assigned enumerated label. 0 represents a label representing no annotation (i.e., used for determining negatives in a confusion matrix); while non zero elements represent assigned enumerated labels for that character. Since I am ignoring true negatives, I think I can say 8 is equivalent to [0, 8]. As to whether this simplifies things, I am not yet sure.

* EDIT 2 *

I'm using [0, 8] to keep things simple and to keep the definitions of intersection and union consistent with set theory.

Grismar · Accepted Answer

I would stay away from calling them 'intersection' and 'union', since those operations have well-defined meanings on sets and the operation you're looking to perform is neither of them.

However, to do what you want:

l0 = [0, 4, 4, 4, 0, 0, 0, 8, 8, 0]
l1 = [0, 1, 1, 1, 0, 0, 0, 8, 8, 8]

values = [
    (x
     if x == y else 0,
     0
     if x == y == 0
     else x if y == 0
     else y if x == 0
     else [x, y]) 
    for x, y in zip(l0, l1)
]

result_a, result_b = map(list, zip(*values))

print(result_a)
print(result_b)

This is more than enough for thousands, or even millions of elements since the operation is so basic. Of course, if we're talking billions, you may want to look at numpy anyway.

How to vectorize this operation

Answers (2)

Related Questions