Reputation: 1960
Is there a native numpy way to convert an array of string representations of booleans eg:
['True','False','True','False']
To an actual boolean array I can use for masking/indexing? I could do a for loop going through and rebuilding the array but for large arrays this is slow.
Upvotes: 8
Views: 9974
Reputation: 21089
I've found a method that's even faster than DSM's, taking inspiration from Eric, though the improvement is best seen with smaller lists of values; at very large values, the cost of the iterating itself starts to outweigh the advantage of performing the truth testing during creation of the numpy array rather than after. Testing with both is
and ==
(for situations where the strings are interned versus when they might not be, as is
would not work with non-interned strings. As 'True'
is probably going to be a literal in the script it should be interned, though) showed that while my version with ==
was slower than with is
, it was still much faster than DSM's version.
Test setup:
import timeit
def timer(statement, count):
return timeit.repeat(statement, "from random import choice;import numpy as np;x = [choice(['True', 'False']) for i in range(%i)]" % count)
>>> stateIs = "y = np.fromiter((e is 'True' for e in x), bool)"
>>> stateEq = "y = np.fromiter((e == 'True' for e in x), bool)"
>>> stateDSM = "y = np.array(x) == 'True'"
With 1000 items, the faster statements take about 66% the time of DSM's:
>>> timer(stateIs, 1000)
[101.77722641656146, 100.74985342340369, 101.47228618107965]
>>> timer(stateEq, 1000)
[112.26464996250706, 112.50754567379681, 112.76057346127709]
>>> timer(stateDSM, 1000)
[155.67689949529995, 155.96820504501557, 158.32394669279802]
For smaller string arrays (in the hundreds rather than thousands), the elapsed time is less than 50% of DSM's:
>>> timer(stateIs, 100)
[11.947757485669172, 11.927990253608186, 12.057855628259858]
>>> timer(stateEq, 100)
[13.064947253943501, 13.161545451986967, 13.30599035623618]
>>> timer(stateDSM, 100)
[31.270060799078237, 30.941749748808434, 31.253922641324607]
A bit over 25% of DSM's when done with 50 items per list:
>>> timer(stateIs, 50)
[6.856538342483873, 6.741083326021908, 6.708402786859551]
>>> timer(stateEq, 50)
[7.346079345032194, 7.312723444475523, 7.309259899921017]
>>> timer(stateDSM, 50)
[24.154247576229864, 24.173593700599667, 23.946403452288905]
For 5 items, about 11% of DSM's:
>>> timer(stateIs, 5)
[1.8826215278058953, 1.850232652068371, 1.8559381315990322]
>>> timer(stateEq, 5)
[1.9252821868467436, 1.894011299061276, 1.894306935199893]
>>> timer(stateDSM, 5)
[18.060974208809057, 17.916322392367874, 17.8379771602049]
Upvotes: 2
Reputation: 353549
You should be able to do a boolean comparison, IIUC, whether the dtype
is a string or object
:
>>> a = np.array(['True', 'False', 'True', 'False'])
>>> a
array(['True', 'False', 'True', 'False'],
dtype='|S5')
>>> a == "True"
array([ True, False, True, False], dtype=bool)
or
>>> a = np.array(['True', 'False', 'True', 'False'], dtype=object)
>>> a
array(['True', 'False', 'True', 'False'], dtype=object)
>>> a == "True"
array([ True, False, True, False], dtype=bool)
Upvotes: 9
Reputation: 97691
Is this good enough?
my_list = ['True', 'False', 'True', 'False']
np.array(x == 'True' for x in my_list)
It's not native, but if you're starting with a non-native list anyway, it really shouldn't matter.
Upvotes: 0