Newmu
Newmu

Reputation: 1960

Numpy Convert String Representation of Boolean Array To Boolean Array

Is there a native numpy way to convert an array of string representations of booleans eg:

['True','False','True','False']

To an actual boolean array I can use for masking/indexing? I could do a for loop going through and rebuilding the array but for large arrays this is slow.

Upvotes: 8

Views: 9974

Answers (3)

JAB
JAB

Reputation: 21089

I've found a method that's even faster than DSM's, taking inspiration from Eric, though the improvement is best seen with smaller lists of values; at very large values, the cost of the iterating itself starts to outweigh the advantage of performing the truth testing during creation of the numpy array rather than after. Testing with both is and == (for situations where the strings are interned versus when they might not be, as is would not work with non-interned strings. As 'True' is probably going to be a literal in the script it should be interned, though) showed that while my version with == was slower than with is, it was still much faster than DSM's version.

Test setup:

import timeit
def timer(statement, count):
    return timeit.repeat(statement, "from random import choice;import numpy as np;x = [choice(['True', 'False']) for i in range(%i)]" % count)

>>> stateIs = "y = np.fromiter((e is 'True' for e in x), bool)"
>>> stateEq = "y = np.fromiter((e == 'True' for e in x), bool)"
>>> stateDSM = "y = np.array(x) == 'True'"

With 1000 items, the faster statements take about 66% the time of DSM's:

>>> timer(stateIs, 1000)
[101.77722641656146, 100.74985342340369, 101.47228618107965]
>>> timer(stateEq, 1000)
[112.26464996250706, 112.50754567379681, 112.76057346127709]
>>> timer(stateDSM, 1000)
[155.67689949529995, 155.96820504501557, 158.32394669279802]

For smaller string arrays (in the hundreds rather than thousands), the elapsed time is less than 50% of DSM's:

>>> timer(stateIs, 100)
[11.947757485669172, 11.927990253608186, 12.057855628259858]
>>> timer(stateEq, 100)
[13.064947253943501, 13.161545451986967, 13.30599035623618]
>>> timer(stateDSM, 100)
[31.270060799078237, 30.941749748808434, 31.253922641324607]

A bit over 25% of DSM's when done with 50 items per list:

>>> timer(stateIs, 50)
[6.856538342483873, 6.741083326021908, 6.708402786859551]
>>> timer(stateEq, 50)
[7.346079345032194, 7.312723444475523, 7.309259899921017]
>>> timer(stateDSM, 50)
[24.154247576229864, 24.173593700599667, 23.946403452288905]

For 5 items, about 11% of DSM's:

>>> timer(stateIs, 5)
[1.8826215278058953, 1.850232652068371, 1.8559381315990322]
>>> timer(stateEq, 5)
[1.9252821868467436, 1.894011299061276, 1.894306935199893]
>>> timer(stateDSM, 5)
[18.060974208809057, 17.916322392367874, 17.8379771602049]

Upvotes: 2

DSM
DSM

Reputation: 353549

You should be able to do a boolean comparison, IIUC, whether the dtype is a string or object:

>>> a = np.array(['True', 'False', 'True', 'False'])
>>> a
array(['True', 'False', 'True', 'False'], 
      dtype='|S5')
>>> a == "True"
array([ True, False,  True, False], dtype=bool)

or

>>> a = np.array(['True', 'False', 'True', 'False'], dtype=object)
>>> a
array(['True', 'False', 'True', 'False'], dtype=object)
>>> a == "True"
array([ True, False,  True, False], dtype=bool)

Upvotes: 9

Eric
Eric

Reputation: 97691

Is this good enough?

my_list = ['True', 'False', 'True', 'False']
np.array(x == 'True' for x in my_list)

It's not native, but if you're starting with a non-native list anyway, it really shouldn't matter.

Upvotes: 0

Related Questions