Optimize value retrieval from a large size dictionary

Question

I have a dictionary id_to_phone. It contains around 350,000 unique id(dictionary key) each represent unique phone number(dictionary value).My requirement is to get phone number for the ids generated by my code.From my code around 10,000 to 50,000 ids are generated & from that ID I need find matching phone number & store it in an array. I used following code

count=phone_id.shape[0]
phone_array=np.array([])
for i in range(count):
    phone=id_to_phone[phone_id[i]]
    phone_array=np.append(phone_array,phone)

But this code is taking very long time.Is there any way to optimize my code?

Matti Lyra · Accepted Answer

You're problem is not the dictionary lookup but the np.append. NumPy arrays are fixed sized contiguous blocks of memory, appending to them, beyond the current size, requires the entire memory block to be re-sized and moved (copied somewhere else), that takes a long time, if you're doing just a few appends it doesn't matter but doing lots of appends is likely to increase the size of the array beyond what was originally allocated. (correction) From the docs:

Return: A copy of arr with values appended to axis. Note that append does not occur in-place: a new array is allocated and filled. If axis is None, out is a flattened array.

So every call to np.append copies the array, no wonder it takes a long time.

Use a regular python list instead, appends are constant time to a list.

import timeit

import numpy as np

def np_append():
    arr = np.asarray([])
    for i in range(5000):
        np.append(arr, i)

def list_append():
    ls = []
    for i in range(5000):
        ls.append(i)

if __name__ == "__main__":
    print(timeit.repeat('np_append()', number=10, repeat=3, globals=globals()))
    print(timeit.repeat('list_append()', number=10, repeat=3, globals=globals()))

the timings are as follows

np_append : [0.15639284392818809, 0.15938732610084116, 0.15667122812010348]
list_append : [0.003160736057907343, 0.004024225985631347, 0.003376785898581147]

Alternatively if you know the number of elements that are going to be added to the list you can preallocate that much space for the numpy.array using phone_array = np.zeros((15000, 10)), for instance, for 15000 10-digit phone numbers.

Optimize value retrieval from a large size dictionary

Answers (1)

Related Questions