Reputation: 1594
I have a 2-D list of shape (300,000, X)
, where each of the sublists has a different size. In order to convert the data to a Tensor, all of the sublists need to have equal length, but I don't want to lose any data from my sublists in the conversion.
That means that I need to fill all sublists smaller than the longest sublist with filler (-1
) in order to create a rectangular array. For my current dataset, the longest sublist is of length 5037.
My conversion code is below:
for seq in new_format:
for i in range(0, length-len(seq)):
seq.append(-1)
However, when there are 300,000 sequences in new_format
, and length-len(seq)
is generally >4000, the process is extraordinarily slow. How can I speed this process up or get around the issue efficiently?
Upvotes: 0
Views: 46
Reputation: 155363
Individual append
calls can be rather slow, so use list
multiplication to create the whole filler value at once, then concatenate it all at once, e.g.:
for seq in new_format:
seq += [-1] * (length-len(seq))
seq.extend([-1] * (length-len(seq)))
would be equivalent (trivially slower due to generalized method call approach, but likely unnoticeable given size of real work).
In theory, seq.extend(itertools.repeat(-1, length-len(seq)))
would avoid the potentially large temporaries, but IIRC, the actual CPython implementation of list.__iadd__
/list.extend
forces the creation of a temporary list
anyway (to handle the case where the generator is defined in terms of the list
being extended), so it wouldn't actually avoid the temporary.
Upvotes: 1