Why is np.unique(array.tolist()) quicker than np.unique(array) when array is type string and the opposite is true for floats?

Question

Consider the arrays below and the test results.
Why is running with tolist() quicker when array elements are strings and not quicker when array elements are floats?

bunch_of_strings = pd.DataFrame(
    np.random.choice(list(ascii_letters), (10000, 4))).sum(1).values
bunch_of_floats = np.random.rand(10000)

user2357112 · Accepted Answer

Your bunch_of_strings array has object dtype, meaning it gets none of the benefits of NumPy. It's basically just a worse list with a bunch of NumPy-specific overhead and a fixed size.

When you call tolist, np.unique has to convert the list back into an array. When it does, it makes an array of dtype dtype('S4'), a string dtype. The benefits of a non-object dtype save a lot of time in the np.unique call, more than is lost in the extra conversions.

In contrast, bunch_of_floats has float64 dtype, and the array->list->array conversion doesn't change that. It just wastes time.

Why is np.unique(array.tolist()) quicker than np.unique(array) when array is type string and the opposite is true for floats?

Answers (1)

Related Questions