Reputation: 803
I want to create a dataset B by processing a dataset A. Therefore, every column in A (~ 2 Mio.) has to be processed in a batch-fashion (putting through a neural network), resulting in 3 outputs which are stacked together and then e.g. stored in a numpy array.
My code looks like the following, which seems to be not the best solution.
# Load data
data = get_data()
# Storage for B
B = np.empty(shape=data.shape)
# Process data
for idx, data_B in enumerate(data):
# Process data
a, b, c = model(data_B)
# Reshape and feed in B
B[idx * batch_size:batch_size * (idx + 1)] = np.squeeze(np.concatenate((a, b, c), axis=1))
I am looking for ideas to speed up the stacking or assigning process. I do not know if it is possible for parallel processing since everything should be stored in the same array finally (the ordering is not important). Is there any python framework I can use?
Loading the data takes 29s (only done once), stacking and assigning takes 20s for a batch size of only 2. The model command takes < 1s, allocating the array takes 5s and all other part <1s.
Upvotes: 0
Views: 85
Reputation: 231385
Your arrays shapes, and especially number of dimensions, is unclear. I can make a few guesses from what works in the code. Your times suggest that things are very large, so memory management may a big issue. Creating large temporary arrays takes time.
What is data.shape
? Probably 2d at least; B
has the same shape
B = np.empty(shape=data.shape)
Now you iterate on the 1st dimension of data
; lets call them rows, though they might be 2d or larger:
# Process data
for idx, data_B in enumerate(data):
# Process data
a, b, c = model(data_B)
What the nature of a
, etc. I'm assuming arrays, with a shape similar to data_B
. But that just a guess.
# Reshape and feed in B
B[idx * batch_size:batch_size * (idx + 1)] =
np.squeeze(np.concatenate((a, b, c), axis=1)
For concatenate
to work a,b,c
must be 2d (at least). Lets guess they are all (n,m). The result is (n,3m). Why the squeeze? Is the shape (1,3m)?
I don't know batch_size
. But with anything other than 1 I don't think this works. B[idx:idx+1, :] = ...
works since idx
ranges the B.shape[0]
, but with other values it would produce an error.
With this batchsize slice indexing it almost looks like you are trying to string out the iteration values in a long 1d array, batchsize
values per iteration. But that doesn't fit with B
matching data
in shape.
That puzzle aside, I wonder if you really need the concatenate. Can you initial B
so you can assign values directly, e.g.
B[idx, 0, ...] = a
B[idx, 1, ...] = b
etc
Reshaping a array after filling is trivial. Even transposing axes isn't too time consuming.
Upvotes: 1