Reputation: 1369
I have data in the form of 0010011
or [False, False, True, False, False, True, True]
. There are roughly 10 million examples of this format, which have not seven but one thousand values each.
In my use case, I'm getting a new entry in the same form. Then, I want to get the indexes of the 100 most equal entries. Here, most equal is defined as having the most intersecting 1
s or True
s respectivly. E.g., the two entries 00011
and 00010
have one intersecting 1
.
Currently, I'm doing the comparison like this:
similarties = []
for idx, binary_1 in enumerate(complete_list):
similarties += [(idx, np.binary_repr(binary_1 & binary_2).count('1'))]
similarties.sort(key=lambda t: t[1], reverse=True)
For 10000 random test entries, this takes 0.2 seconds. Is there any faster way to do this?
Upvotes: 2
Views: 1471
Reputation: 53089
Update: found a threefold speedup.
Here is an approach that uses numpy on packed bits to save memory. To convert between this format and individual 0
s and 1
s of type uint8
numpy provides functions packbits
and unpackbits
.
The code below precomputes the sums for all 2^16
patterns that can be formed by chunks of 16 bits.
(The old version did the lookup for pairs of bytes in the data and the template)
We use view casting to uint64
to perfom the bitwise intersection on chunks of 64
bits and then cast back to uint16
for lookup.
To find the n
closest we use argpartition
(O(N)) instead of argsort
(O(N log N)).
import numpy as np
n, m = 1_000_000, 1_000
data = np.random.randint(0, 256, (n, (m + 63) // 64 * 8), dtype=np.uint8)
test = np.random.randint(0, 256, ((m + 63) // 64 * 8,), dtype=np.uint8)
def create_lookup_1d():
x, p = np.ogrid[:1<<16, :16]
p = 1 << p
return np.count_nonzero(x & p, axis=1)
lookup_1d = create_lookup_1d()
def find_closest(data, test, n):
similarities = lookup_1d[(data.view(np.uint64) & test.view(np.uint64))
.view(np.uint16)].sum(axis=1)
top_n = np.argpartition(similarities, len(data)-n)[-n:]
return top_n, similarities[top_n]
# below is obsolete older version
def create_lookup_2d():
x, y, p = np.ogrid[:256, :256, :8]
p = 1 << p
return np.count_nonzero(x & y & p, axis=2)
lookup_2d = create_lookup_2d()
def find_closest_old(data, test, n):
similarities = lookup_2d[data, test].sum(axis=1)
top_n = np.argpartition(similarities, len(data)-n)[-n:]
return top_n, similarities[top_n]
Demo (one million entries, one thousand bits each, find hundred best):
>>> import time
>>> t = time.perf_counter(); find_closest(data, test, 100); t = time.perf_counter() - t
(array([913659, 727762, 819589, 810536, 671392, 573459, 197431, 642848,
8792, 710169, 656667, 692412, 23355, 695527, 276548, 756096,
286481, 931702, 301590, 309232, 223684, 838507, 888607, 205403,
988198, 600239, 256834, 876452, 793813, 46501, 559521, 697295,
948215, 247923, 503962, 808630, 515953, 22821, 614888, 487735,
443673, 174083, 906902, 613131, 546603, 147657, 332898, 381553,
808760, 383885, 107547, 85942, 20966, 880034, 522925, 18833,
547674, 901503, 702596, 773050, 734658, 383581, 973043, 387713,
645705, 27045, 230226, 77403, 906601, 507193, 828268, 175863,
708155, 130634, 486701, 534719, 643487, 940071, 694781, 470385,
954446, 134532, 748100, 110987, 417001, 871320, 993915, 489725,
6509, 38345, 705618, 637435, 311252, 347282, 536091, 663643,
830238, 376695, 896090, 823631]), array([305, 305, 305, 305, 305, 305, 305, 305, 305, 305, 305, 305, 305,
305, 306, 305, 306, 306, 305, 305, 305, 305, 305, 306, 305, 306,
305, 306, 314, 308, 307, 309, 306, 308, 307, 308, 307, 312, 308,
306, 316, 306, 306, 307, 307, 308, 309, 308, 307, 309, 309, 311,
309, 310, 310, 307, 307, 306, 307, 306, 307, 309, 308, 309, 308,
306, 307, 309, 306, 306, 306, 311, 306, 308, 307, 306, 306, 307,
308, 306, 307, 310, 307, 306, 307, 309, 306, 306, 310, 313, 306,
306, 307, 310, 306, 307, 307, 309, 311, 307]))
>>> t
0.4612512579988106
Upvotes: 1
Reputation: 9887
Using broadcasting might help here. For example,
import numpy as np
complete_list = np.random.randint(0, 2, (10000, 10)).astype(bool)
binary_2 = np.random.randint(0, 2, 10).astype(bool)
similarities = np.sum(complete_list & binary_2, axis=1)
idx = np.argsort(similarities)
print("Seed", binary_2)
print("Result", complete_list[idx[-1]])
print("Similarity", similarities[idx[-1]])
I couldn't get your example to run (maybe different python/library versions?) so haven't run any benchmarks comparing the two approaches. Of course, our machines will be different but the above takes about half a millisecond on mine.
Note that I've used &
rather than |
given your description of the intended logic.
Upvotes: 0