Evan Weissburg
Evan Weissburg

Reputation: 1594

How can a list's lists be modified efficiently to have equal length to the list's longest list?

I have a 2-D list of shape (300,000, X), where each of the sublists has a different size. In order to convert the data to a Tensor, all of the sublists need to have equal length, but I don't want to lose any data from my sublists in the conversion.

That means that I need to fill all sublists smaller than the longest sublist with filler (-1) in order to create a rectangular array. For my current dataset, the longest sublist is of length 5037.

My conversion code is below:

for seq in new_format:
    for i in range(0, length-len(seq)):
        seq.append(-1)

However, when there are 300,000 sequences in new_format, and length-len(seq) is generally >4000, the process is extraordinarily slow. How can I speed this process up or get around the issue efficiently?

Upvotes: 0

Views: 46

Answers (1)

ShadowRanger
ShadowRanger

Reputation: 155363

Individual append calls can be rather slow, so use list multiplication to create the whole filler value at once, then concatenate it all at once, e.g.:

for seq in new_format:
    seq += [-1] * (length-len(seq))

seq.extend([-1] * (length-len(seq))) would be equivalent (trivially slower due to generalized method call approach, but likely unnoticeable given size of real work).

In theory, seq.extend(itertools.repeat(-1, length-len(seq))) would avoid the potentially large temporaries, but IIRC, the actual CPython implementation of list.__iadd__/list.extend forces the creation of a temporary list anyway (to handle the case where the generator is defined in terms of the list being extended), so it wouldn't actually avoid the temporary.

Upvotes: 1

Related Questions