Reputation: 3437
I am working with a dataset that includes tens of thousands of bitstrings of variable length up to 32768.
The data originates in a list, which I then concatenate into multiple binary bitstrings, which in turn are converted into an INT.
To demonstrate my method:
import array
a = array.array('B', [4,5,13,4,4,9,12,13])
d = dict()
for i in set(a):
d[i] = int('0b' + ''.join([str(int(i is b)) for b in a]), 2)
print d
My question is: is there a more efficient way of accomplishing this without constructing a ~32000 length string and then adding '0b' to cast it to an INT? (I assume my method uses up at least 32000 bytes in addition to the source list for the operation)
In addition, is INT the cheapest way to store these, assuming I wish to store them in a form that permits bit-wise operations?
Upvotes: 1
Views: 2788
Reputation: 151157
If you need maximal efficiency, one package that should get you pretty close is bitarray
. It's written in c, so it's lightning fast.
To create a bitarray
, you can pass the bitarray
constructor any sequence of boolean-like values. For example:
>>> bitarray.bitarray([1, 1, 0, 1])
bitarray('1101')
>>> bitarray.bitarray([True, True, False, True])
bitarray('1101')
>>> bitarray.bitarray([range(3), [5.829048], [], ['can', 'I', 'help', 'you?']])
bitarray('1101')
I did some timings, and indeed bitarray
is fastest for longer strings. But there were a few surprises:
bitarray
is only 50% faster than int(''.join(bs))
in most cases.int(''.join(bs))
is faster than the shift
method proposed by tomasz for lengths of a
greater than 3000, and way faster for lengths of a
greater than 30000.len(a)
, using the shift
method is only a few times faster. The asymptotic performance boost that ''.join()
provides for longer strings is on the order of seconds or tens of seconds, while the shift
method only provides gains on the order of milliseconds for small strings. So using ''.join()
is the clear winner here.So if you don't want to use an outside library, then using ''.join()
as you do above is the best solution! And indeed, the benefit of using an outside library in this case is minimal; so in the end, I wouldn't recommend it after all -- unless you mainly want to save memory.
Finally, one minor note: You don't have to append '0b'
to the string as you do above. Just call int(bitstring, 2)
-- the base argument (2
) makes the '0b'
redundant.
>>> import array
>>> import random
>>> import bitarray
>>>
>>> #### Definitions: ####
>>>
>>> def a_mask_join(a):
..... d = dict()
..... for i in set(a):
..... d[i] = int(''.join([str(int(i is b)) for b in a]), 2)
..... return d
.....
>>> def mask(values, x):
..... m = 0
..... for v in values:
..... m = (m << 1) + (v == x)
..... return m
.....
>>> def a_mask_shift(a):
..... d = dict()
..... for i in set(a):
..... d[i] = mask(a, i)
..... return d
.....
>>> def a_mask_bitarray1(a):
..... d = dict()
..... for i in set(a):
..... d[i] = bitarray.bitarray([int(i is b) for b in a])
..... return d
.....
>>> def a_mask_bitarray2(a):
..... d = dict()
..... for i in set(a):
..... d[i] = int(bitarray.bitarray([int(i is b) for b in a]).to01(), 2)
..... return d
.....
>>> a = array.array('B', [4,5,13,4,4,9,12,13])
>>>
>>> #### Test: ####
>>>
>>> dicts = (f(a) for f in (a_mask_join, a_mask_shift1, a_mask_shift2, a_mask_bitarray2))
>>> sorted_results = (sorted(int(v) for v in d.values()) for d in dicts)
>>> all(r == sorted(a_mask1(a).values()) for r in sorted_results)
True
>>>
>>> #### Timing: ####
>>>
>>> for size in (int(10 ** (e / 2.0)) for e in range(2, 11)):
..... print size
..... a = array.array('B', [random.randrange(0, 30) for _ in range(size)])
..... %timeit a_mask_join(a)
..... %timeit a_mask_shift(a)
..... %timeit a_mask_bitarray1(a)
..... %timeit a_mask_bitarray2(a)
.....
10
10000 loops, best of 3: 61.2 us per loop
100000 loops, best of 3: 17.5 us per loop
10000 loops, best of 3: 38.4 us per loop
10000 loops, best of 3: 46.7 us per loop
31
1000 loops, best of 3: 343 us per loop
10000 loops, best of 3: 97.9 us per loop
1000 loops, best of 3: 212 us per loop
1000 loops, best of 3: 242 us per loop
100
1000 loops, best of 3: 1.45 ms per loop
1000 loops, best of 3: 486 us per loop
1000 loops, best of 3: 825 us per loop
1000 loops, best of 3: 870 us per loop
316
100 loops, best of 3: 4.53 ms per loop
100 loops, best of 3: 2.46 ms per loop
100 loops, best of 3: 2.53 ms per loop
100 loops, best of 3: 2.65 ms per loop
1000
100 loops, best of 3: 14.5 ms per loop
100 loops, best of 3: 10.8 ms per loop
100 loops, best of 3: 7.78 ms per loop
100 loops, best of 3: 8.04 ms per loop
3162
10 loops, best of 3: 47.4 ms per loop
10 loops, best of 3: 71.8 ms per loop
10 loops, best of 3: 24.1 ms per loop
10 loops, best of 3: 25.6 ms per loop
10000
10 loops, best of 3: 137 ms per loop
1 loops, best of 3: 425 ms per loop
10 loops, best of 3: 75.7 ms per loop
10 loops, best of 3: 78 ms per loop
31622
1 loops, best of 3: 430 ms per loop
1 loops, best of 3: 3.25 s per loop
1 loops, best of 3: 241 ms per loop
1 loops, best of 3: 246 ms per loop
100000
1 loops, best of 3: 1.37 s per loop
1 loops, best of 3: 29.7 s per loop
1 loops, best of 3: 805 ms per loop
1 loops, best of 3: 800 ms per loop
Upvotes: 6
Reputation: 13072
Not sure if I get your intention right, but my understanding is you want to create an integer which is masking where the value exists in an array. If so, you don't need to create string at all, simply do it manually:
def mask(values, x):
m = 0
for v in values:
m = (m << 1) + (v == x) # shift left and flip the lowest bit if needed
return m
And in action:
for i in set(a):
d[i] = mask(a, i)
It's much faster than using join and doesn't use additional memory but the desired integer. I bet you can use a few methods from itertools
, pipe them together and impress some people around, but I would just stick with a method. And I think you're right, can't imagine better way of storing it than an integer.
Upvotes: 1
Reputation: 96984
Take a look at the bitstring library. You can manipulate individual bits of a BitArray
(the following example shows immutable Bits
instances):
from bitstring import Bits
def bitLen(int_type):
length = 0
while (int_type):
int_type >>= 1
length += 1
return(length)
intList = [4,5,13,4,4,9,12,13]
intSet = set(intList)
for i in intSet:
print i, Bits(int=i, length=bitLen(i)+1).bin
Which provides:
9 0b01001
4 0b0100
5 0b0101
12 0b01100
13 0b01101
You can use Bits.int
to convert a bit representation to int
, and Bits.len
to get the length. I'll leave it to you to work this into the dictionary form you expect.
Upvotes: 3