How to convert 2D arrays in dictionary into one single array?

Question

I have the following code:

import random
import numpy as np
import pandas as pd

num_seq = 100
len_seq = 20
nts = 4
sequences = np.random.choice(nts, size = (num_seq, len_seq), replace=True)
sequences = np.unique(sequences, axis=0) #sorts the sequences

d = {}
pr = 5

for i in range(num_seq):
    globals()['seq_' + str(i)] = np.tile(sequences[i,:],(pr,1))
    d['seq_' + str(i)] = np.tile(sequences[i,:],(pr,1))

pool = np.empty((0,len_seq),dtype=int)
for i in range(num_seq):
    pool = np.concatenate((pool,eval('seq_' +str(i))))

I want to convert the dictionary d into a Numpy array (or a dictionary with just one entry). My code works, producing pool. However, at bigger values for num_seq, len_seq and pr, it takes a very long time.

The execution time is critical, thus my question: is there a more efficient way of doing this?

J&#233;r&#244;me Richard · Accepted Answer

Here is a list of important points:

np.concatenate runs in O(n) so your second loop is running in O(n^2) time. You can append the value to a list and np.vstack all the values in the end (in O(n) time).
accessing globals() is slow and known to be a bad practice (because it can easily break your code in nasty ways);
calling eval(...) is slow too and also unsafe, so avoid it;
the default CPython interpreter does not optimize duplicated expression (it recompute them).
You can use Cython to slightly speed up the code or Numba (note that the support of dictionaries is experimental in Numba).

Here is an example of faster code (in replacement of the second loop):

pool = np.vstack([d[f'seq_{i}'] for i in range(num_seq)])

How to convert 2D arrays in dictionary into one single array?

Answers (1)

Related Questions