Reputation: 2161
Say I have two lists (always the same length):
l0 = [0, 4, 4, 4, 0, 0, 0, 8, 8, 0]
l1 = [0, 1, 1, 1, 0, 0, 0, 8, 8, 8]
I have the following rules for intersections and unions I need to apply when comparing these lists element-wise:
# union and intersect
uni = [0]*len(l0)
intersec = [0]*len(l0)
for i in range(len(l0)):
if l0[i] == l1[i]:
uni[i] = l0[i]
intersec[i] = l0[i]
else:
intersec[i] = 0
if l0[i] == 0:
uni[i] = l1[i]
elif l1[i] == 0:
uni[i] = l0[i]
else:
uni[i] = [l0[i], l1[i]]
Thus, the desired output is:
uni: [0, [4, 1], [4, 1], [4, 1], 0, 0, 0, 8, 8, 8]
intersec: [0, 0, 0, 0, 0, 0, 0, 8, 8, 0]
While this works, I need to do this with several hundred very large lists (each, with thousands of elements), so I am looking for a way to vectorize this. I tried using np.where
and various masking strategies, but that went nowhere fast. Any suggestions would be most welcome.
* EDIT *
Regarding
uni: [0, [4, 1], [4, 1], [4, 1], 0, 0, 0, 8, 8, 8]
versus
uni: [0, [4, 1], [4, 1], [4, 1], 0, 0, 0, 8, 8, [0, 8]]
I'm still fighting the 8 versus [0, 8] in my mind. The lists are derived from BIO tags in system annotations (see IOB labeling of text chunks), where each list element is a character index in a document and the vakue is an assigned enumerated label. 0 represents a label representing no annotation (i.e., used for determining negatives in a confusion matrix); while non zero elements represent assigned enumerated labels for that character. Since I am ignoring true negatives, I think I can say 8 is equivalent to [0, 8]. As to whether this simplifies things, I am not yet sure.
* EDIT 2 *
I'm using [0, 8]
to keep things simple and to keep the definitions of intersection
and union
consistent with set theory.
Upvotes: 5
Views: 215
Reputation: 31319
I would stay away from calling them 'intersection' and 'union', since those operations have well-defined meanings on sets and the operation you're looking to perform is neither of them.
However, to do what you want:
l0 = [0, 4, 4, 4, 0, 0, 0, 8, 8, 0]
l1 = [0, 1, 1, 1, 0, 0, 0, 8, 8, 8]
values = [
(x
if x == y else 0,
0
if x == y == 0
else x if y == 0
else y if x == 0
else [x, y])
for x, y in zip(l0, l1)
]
result_a, result_b = map(list, zip(*values))
print(result_a)
print(result_b)
This is more than enough for thousands, or even millions of elements since the operation is so basic. Of course, if we're talking billions, you may want to look at numpy anyway.
Upvotes: 2
Reputation: 1329
Semi vectorized solution for union and full for intersection:
import numpy as np
l0 = np.array(l0)
l1 = np.array(l1)
intersec = np.zeros(l0.shape[0])
intersec_idx = np.where(l0==l1)
intersec[intersec_idx] = l0[intersec_idx]
intersec = intersec.astype(int).tolist()
union = np.zeros(l0.shape[0])
union_idx = np.where(l0==l1)
union[union_idx] = l0[union_idx]
no_union_idx = np.where(l0!=l1)
union = union.astype(int).tolist()
for idx in no_union_idx[0]:
union[idx] = [l0[idx], l1[idx]]
and the output:
>>> intersection
[0, 0, 0, 0, 0, 0, 0, 8, 8, 0]
>>> union
[0, [4, 1], [4, 1], [4, 1], 0, 0, 0, 8, 8, [0, 8]]
NB: I think your original union solution is incorrect. See the last output 8 vs [0,8]
Upvotes: 0