piRSquared
piRSquared

Reputation: 294298

Why is np.unique(array.tolist()) quicker than np.unique(array) when array is type string and the opposite is true for floats?

Consider the arrays below and the test results.
Why is running with tolist() quicker when array elements are strings and not quicker when array elements are floats?

bunch_of_strings = pd.DataFrame(
    np.random.choice(list(ascii_letters), (10000, 4))).sum(1).values
bunch_of_floats = np.random.rand(10000)

enter image description here

Upvotes: 0

Views: 710

Answers (1)

user2357112
user2357112

Reputation: 280898

Your bunch_of_strings array has object dtype, meaning it gets none of the benefits of NumPy. It's basically just a worse list with a bunch of NumPy-specific overhead and a fixed size.

When you call tolist, np.unique has to convert the list back into an array. When it does, it makes an array of dtype dtype('S4'), a string dtype. The benefits of a non-object dtype save a lot of time in the np.unique call, more than is lost in the extra conversions.

In contrast, bunch_of_floats has float64 dtype, and the array->list->array conversion doesn't change that. It just wastes time.

Upvotes: 6

Related Questions