Reputation: 294298
Consider the arrays below and the test results.
Why is running with tolist()
quicker when array elements are strings and not quicker when array elements are floats?
bunch_of_strings = pd.DataFrame(
np.random.choice(list(ascii_letters), (10000, 4))).sum(1).values
bunch_of_floats = np.random.rand(10000)
Upvotes: 0
Views: 710
Reputation: 280898
Your bunch_of_strings
array has object dtype, meaning it gets none of the benefits of NumPy. It's basically just a worse list with a bunch of NumPy-specific overhead and a fixed size.
When you call tolist
, np.unique
has to convert the list back into an array. When it does, it makes an array of dtype dtype('S4')
, a string dtype. The benefits of a non-object dtype save a lot of time in the np.unique
call, more than is lost in the extra conversions.
In contrast, bunch_of_floats
has float64 dtype, and the array->list->array conversion doesn't change that. It just wastes time.
Upvotes: 6