Reputation: 79
I have billions lines of string like this: 1010101110100111100100101 in memory. I need to convert this into binary integer list. It will take minutes, seems too slow. My code :
def string2vec(binary_str):
return [int(binary_str[i:i + 8], base=2) for i in range(0, 64, 8)]
result= [ string2vec(l) for l in lines ] # this code is slow
binary_str length is 64, and every 8 binary chars into 1 binary int.
Upvotes: 3
Views: 872
Reputation: 4283
EDIT: It appears this functionality may be built into python; see comments. I'll leave this answer because it provides a minimal working example of a C library for Python that manipulates arrays, which I was not able to find elsewhere online.
I agree with many of the comments that something has clearly gone wrong if you have a bunch of binary strings in human-readable format sitting around in memory. However, if there are reasons outside your control that this can't be avoided, you could try writing the relevant functionality in C. Here's a straightforward example to start from:
include <Python.h>
static PyObject * binary_string(PyObject * self, PyObject * args);
static PyMethodDef PyBinaryString_methods[] =
{
{ "binary_string", binary_string, METH_VARARGS, "binary string" },
{ NULL, NULL, 0, NULL }
};
static struct PyModuleDef PyBinaryString_module =
{
PyModuleDef_HEAD_INIT,
"PyBinaryString",
"Binary String",
-1,
PyBinaryString_methods
};
PyMODINIT_FUNC PyInit_PyBinaryString(void)
{
return PyModule_Create(&PyBinaryString_module);
}
static PyObject * binary_string(PyObject * self, PyObject * args)
{
const char * string;
char buf[8];
if(!PyArg_ParseTuple(args, "s", &string)) { return NULL; }
for(int i = 0; i < 8; i++)
{
buf[i] = 0;
for(int j = 0; j < 8; j++)
{
buf[i] |= (string[8 * i + j] & 1) << (7 - j);
}
}
return PyByteArray_FromStringAndSize(buf, 8);
}
Here I'm exploiting the fact that the string is going to consist of ASCII '0' and '1' characters exclusively, and that the ASCII code for the former is even whereas the ASCII code for the latter is odd.
On my system I can compile this via
cc -fPIC -shared -O3 -I/usr/include/python -o PyBinaryString.so PyBinaryString.c
and then use it in Python like so:
>>> from PyBinaryString import binary_string
>>> binary_string("1111111111111111111111111111111111111111111111111111111100000000")
bytearray(b'\xff\xff\xff\xff\xff\xff\xff\x00')
I'm not a Python programmer, so someone might be able to provide a better way of getting data in/out of the python object formats. However on my machine this runs about an order of magnitude faster than the native python version.
If you know more about the layout in memory -- say if you know that all the strings of ASCII '0' and '1' characters are contiguous -- you could have the C code convert everything at once, which would probably speed things up further.
Upvotes: 4
Reputation: 178001
binary_str length is 64, and every 8 binary chars into 1 binary int.
All that string slicing and Python looping is expensive. Use int(s,2)
to convert the whole binary string to integer. Then use array
to manage the integers as 64-bit integers and convert to 8-bit integers. You can decide if you want big- or little-endian results for the bytes:
import random
import time
import array
ints = [random.randrange(1<<64) for _ in range(1000)] # Make 1000 integers
strs = [f'{n:064b}' for n in ints] # Represent as binary strings
print(f'{ints[0]:016X} {strs[0]}')
start = time.perf_counter()
ints2 = [int(s,2) for s in strs] # convert all the strings to integers
a = array.array('Q',ints) # Store in an array. Q = quadwords (64-bit ints)
a.byteswap() # Optional if you want the opposite endian-ness of your machine.
b = array.array('B') # Another array of bytes
b.frombytes(a.tobytes()) # Populate byte array with the bytes from the quadword array.
print(time.perf_counter() - start)
assert ints == ints2
print([hex(n) for n in b[:8]])
Output:
1E27DFA21406A338 0001111000100111110111111010001000010100000001101010001100111000
0.0005346000000372442
['0x1e', '0x27', '0xdf', '0xa2', '0x14', '0x6', '0xa3', '0x38']
My machine is little-endian (most are). This converts one thousand 64-digit binary strings into integers, stores them in an array, byte swaps them to represent big-endian, then remaps the bytes of the array to a byte array...all in 534.6 microseconds on my machine. I've displayed the first 64-character string and its hexadecimal representation, and the first 8 bytes of the final result. If you truly have "billions" of these strings, it'll take about 9 minutes per billion, but don't read them all into memory at once :)
Upvotes: 4
Reputation: 4673
Since there are only 2^8 = 256 possible values, you could construct a lookup table (in the form of a dict) containing 8-character strings as keys as corresponding integers as values.
Upvotes: 0