hexparrot
hexparrot

Reputation: 3437

What is the most efficient way to concatenate and store long bit-strings?

I am working with a dataset that includes tens of thousands of bitstrings of variable length up to 32768.

The data originates in a list, which I then concatenate into multiple binary bitstrings, which in turn are converted into an INT.

To demonstrate my method:

import array

a = array.array('B', [4,5,13,4,4,9,12,13])
d = dict()

for i in set(a):
    d[i] = int('0b' + ''.join([str(int(i is b)) for b in a]), 2)

print d

My question is: is there a more efficient way of accomplishing this without constructing a ~32000 length string and then adding '0b' to cast it to an INT? (I assume my method uses up at least 32000 bytes in addition to the source list for the operation)

In addition, is INT the cheapest way to store these, assuming I wish to store them in a form that permits bit-wise operations?

Upvotes: 1

Views: 2788

Answers (3)

senderle
senderle

Reputation: 151157

If you need maximal efficiency, one package that should get you pretty close is bitarray. It's written in c, so it's lightning fast.

To create a bitarray, you can pass the bitarray constructor any sequence of boolean-like values. For example:

>>> bitarray.bitarray([1, 1, 0, 1])
bitarray('1101')
>>> bitarray.bitarray([True, True, False, True])
bitarray('1101')
>>> bitarray.bitarray([range(3), [5.829048], [], ['can', 'I', 'help', 'you?']])
bitarray('1101')

I did some timings, and indeed bitarray is fastest for longer strings. But there were a few surprises:

  1. bitarray is only 50% faster than int(''.join(bs)) in most cases.
  2. int(''.join(bs)) is faster than the shift method proposed by tomasz for lengths of a greater than 3000, and way faster for lengths of a greater than 30000.
  3. Even for small len(a), using the shift method is only a few times faster. The asymptotic performance boost that ''.join() provides for longer strings is on the order of seconds or tens of seconds, while the shift method only provides gains on the order of milliseconds for small strings. So using ''.join() is the clear winner here.

So if you don't want to use an outside library, then using ''.join() as you do above is the best solution! And indeed, the benefit of using an outside library in this case is minimal; so in the end, I wouldn't recommend it after all -- unless you mainly want to save memory.

Finally, one minor note: You don't have to append '0b' to the string as you do above. Just call int(bitstring, 2) -- the base argument (2) makes the '0b' redundant.

>>> import array
>>> import random
>>> import bitarray
>>> 
>>>        #### Definitions: ####
>>> 
>>> def a_mask_join(a):
.....     d = dict()
.....     for i in set(a):
.....         d[i] = int(''.join([str(int(i is b)) for b in a]), 2)
.....     return d
..... 
>>> def mask(values, x):
.....     m = 0
.....     for v in values:
.....         m = (m << 1) + (v == x)
.....     return m
..... 
>>> def a_mask_shift(a):
.....     d = dict()
.....     for i in set(a):
.....         d[i] = mask(a, i)
.....     return d
..... 
>>> def a_mask_bitarray1(a):
.....     d = dict()
.....     for i in set(a):
.....         d[i] = bitarray.bitarray([int(i is b) for b in a])
.....     return d
..... 
>>> def a_mask_bitarray2(a):
.....     d = dict()
.....     for i in set(a):
.....         d[i] = int(bitarray.bitarray([int(i is b) for b in a]).to01(), 2)
.....     return d
..... 
>>> a = array.array('B', [4,5,13,4,4,9,12,13])
>>> 
>>>        #### Test: ####
>>> 
>>> dicts = (f(a) for f in (a_mask_join, a_mask_shift1, a_mask_shift2, a_mask_bitarray2))
>>> sorted_results = (sorted(int(v) for v in d.values()) for d in dicts)
>>> all(r == sorted(a_mask1(a).values()) for r in sorted_results)
True
>>> 
>>>        #### Timing: ####
>>> 
>>> for size in (int(10 ** (e / 2.0)) for e in range(2, 11)):
.....     print size
.....     a = array.array('B', [random.randrange(0, 30) for _ in range(size)])
.....     %timeit a_mask_join(a)
.....     %timeit a_mask_shift(a)
.....     %timeit a_mask_bitarray1(a)
.....     %timeit a_mask_bitarray2(a)
..... 
10
10000 loops, best of 3: 61.2 us per loop
100000 loops, best of 3: 17.5 us per loop
10000 loops, best of 3: 38.4 us per loop
10000 loops, best of 3: 46.7 us per loop
31
1000 loops, best of 3: 343 us per loop
10000 loops, best of 3: 97.9 us per loop
1000 loops, best of 3: 212 us per loop
1000 loops, best of 3: 242 us per loop
100
1000 loops, best of 3: 1.45 ms per loop
1000 loops, best of 3: 486 us per loop
1000 loops, best of 3: 825 us per loop
1000 loops, best of 3: 870 us per loop
316
100 loops, best of 3: 4.53 ms per loop
100 loops, best of 3: 2.46 ms per loop
100 loops, best of 3: 2.53 ms per loop
100 loops, best of 3: 2.65 ms per loop
1000
100 loops, best of 3: 14.5 ms per loop
100 loops, best of 3: 10.8 ms per loop
100 loops, best of 3: 7.78 ms per loop
100 loops, best of 3: 8.04 ms per loop
3162
10 loops, best of 3: 47.4 ms per loop
10 loops, best of 3: 71.8 ms per loop
10 loops, best of 3: 24.1 ms per loop
10 loops, best of 3: 25.6 ms per loop
10000
10 loops, best of 3: 137 ms per loop
1 loops, best of 3: 425 ms per loop
10 loops, best of 3: 75.7 ms per loop
10 loops, best of 3: 78 ms per loop
31622
1 loops, best of 3: 430 ms per loop
1 loops, best of 3: 3.25 s per loop
1 loops, best of 3: 241 ms per loop
1 loops, best of 3: 246 ms per loop
100000
1 loops, best of 3: 1.37 s per loop
1 loops, best of 3: 29.7 s per loop
1 loops, best of 3: 805 ms per loop
1 loops, best of 3: 800 ms per loop

Upvotes: 6

tomasz
tomasz

Reputation: 13072

Not sure if I get your intention right, but my understanding is you want to create an integer which is masking where the value exists in an array. If so, you don't need to create string at all, simply do it manually:

def mask(values, x):
  m = 0
  for v in values:
     m = (m << 1) + (v == x) # shift left and flip the lowest bit if needed
  return m

And in action:

for i in set(a):
  d[i] = mask(a, i)

It's much faster than using join and doesn't use additional memory but the desired integer. I bet you can use a few methods from itertools, pipe them together and impress some people around, but I would just stick with a method. And I think you're right, can't imagine better way of storing it than an integer.

Upvotes: 1

Alex Reynolds
Alex Reynolds

Reputation: 96984

Take a look at the bitstring library. You can manipulate individual bits of a BitArray (the following example shows immutable Bits instances):

from bitstring import Bits

def bitLen(int_type):
    length = 0
    while (int_type):
        int_type >>= 1
        length += 1
    return(length)

intList = [4,5,13,4,4,9,12,13]
intSet = set(intList)
for i in intSet:
   print i, Bits(int=i, length=bitLen(i)+1).bin

Which provides:

9 0b01001
4 0b0100
5 0b0101
12 0b01100
13 0b01101

You can use Bits.int to convert a bit representation to int, and Bits.len to get the length. I'll leave it to you to work this into the dictionary form you expect.

Upvotes: 3

Related Questions