FancyXun
FancyXun

Reputation: 1298

Why numpy array of strings indexing is slower than numpy array of object indexing?

Sample code

import numpy as np
import time


class A:
    def __init__(self, n):
        self.n = n

    def str_n(self):
        return str(self.n)


idx = np.asarray(list(range(30000)))
l_a = []
for i in range(400000):
    l_a.append(A(i))

l_a_arr = np.asarray(l_a)
l_a_str_arr = np.asarray([i.str_n() for i in l_a])


s_time = time.time()
l_a_idx_str_arr = l_a_str_arr[idx].tolist()
cost_time = time.time() - s_time
print("String array cost time is ", cost_time)

s_time = time.time()
l_a_idx_arr = l_a_arr[idx].tolist()
cost_time = time.time() - s_time
print("Class array cost time is ", cost_time)

The logs:

String array cost time is 0.0014674663543701172
Class array cost time is 0.0003917217254638672

UPDATE
repeat 1000 time and remove tolist()

import numpy as np
import time


class A:
    def __init__(self, n):
        self.inner_n = n + 111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111

    def str_n(self):
        return str(self.inner_n)


idx = np.asarray(list(range(30000)))
l_a = []
for i in range(400000):
    l_a.append(A(i))

l_a_arr = np.asarray(l_a)
l_a_str_arr = np.asarray([i.str_n() for i in l_a])

avg_time = []
for i in range(1000):
    s_time = time.time()
    l_a_idx_str_arr = l_a_str_arr[idx].tolist()
    cost_time = time.time() - s_time
    avg_time.append(cost_time)
print("String array cost time with tolist is ", np.average(avg_time))

avg_time1 = []
for i in range(1000):
    s_time = time.time()
    l_a_idx_arr = l_a_arr[idx].tolist()
    cost_time = time.time() - s_time
    avg_time1.append(cost_time)
print("Class array cost time with tolist is ", np.average(avg_time1))

avg_time2 = []
for i in range(1000):
    s_time = time.time()
    l_a_idx_str_arr = l_a_str_arr[idx]
    cost_time = time.time() - s_time
    avg_time2.append(cost_time)
print("String array cost time is ", np.average(avg_time2))

avg_time3 = []
for i in range(1000):
    s_time = time.time()
    l_a_idx_arr = l_a_arr[idx]
    cost_time = time.time() - s_time
    avg_time3.append(cost_time)
print("Class array cost time is ", np.average(avg_time3))

The logs:

String array 1000 average cost time with tolist is 0.0037294850349426267
Class array 1000 average cost time with tolist is 0.00030662870407104493
String array 1000 average cost time is 0.0014972503185272216
Class array 1000 average cost time is 0.0001489844322204589

The array of strings is a part of array of object, why its indexing spent more time?

Upvotes: 0

Views: 152

Answers (1)

hpaulj
hpaulj

Reputation: 231385

Object dtype arrays are like lists, storing references to objects. Indexing is nearly as fast as with lists.

String dtype arrays store strings as bytes, just as they do with numbers. Indexing individual elements is slower since it requires a conversion from the numpy bytes to python strings ('unboxing').

Arrays are best used 'whole' rather than iteratively.

Upvotes: 1

Related Questions