Split a large numpy array into separate arrays with a list of grouped indices

Question

Given 2 arrays: One for a master dataset, and the second as list of grouped indices that reference the master dataset. I'm looking for the fastest to generate new arrays from the given index data?

Here's my current solution for generating 2 arrays from a list of double keys:

# Lets make a large point cloud with 1 million entries and a list of random paired indices
import numpy as np
COUNT = 1000000
POINT_CLOUD = np.random.rand(COUNT,3) * 100
INDICES = (np.random.rand(COUNT,2)*COUNT).astype(int)  # (1,10),(233,12),...

# Split into sublists, np.squeeze is needed here because i don't want arrays of single elements.
LIST1 = POINT_CLOUD[np.squeeze(INDICES[:,[0]])]
LIST2 = POINT_CLOUD[np.squeeze(INDICES[:,[1]])]

This works, but it's a little slow, and it's only good for generating 2 lists, it would be great to have a solution that could tackle any size of index groups (ex: ((1,2,3,4),(8,4,5,3),...)

so something like:

# PSEUDO CODE using quadruple keys
INDICES = (np.random.rand(COUNT,4)*COUNT).astype(int)
SPLIT = POINT_CLOUD[[INDICES]]
SPLIT[0] = np.array([points from INDEX #1])
SPLIT[1] = np.array([points from INDEX #2])
SPLIT[2] = np.array([points from INDEX #3])
SPLIT[3] = np.array([points from INDEX #4])

YXD · Accepted Answer

You just have to reshape the index array:

>>> result = POINT_CLOUD[INDICES.T]
>>> np.allclose(result[0], LIST1)
True
>>> np.allclose(result[1], LIST2)
True

If you know the number of sub-arrays you can also unpack the list

>>> result.shape
(2, 1000000, 3)
>>> L1, L2 = result
>>> np.allclose(L1, LIST1)
True
>>> # etc

This works for larger index groups. For the second example in your question:

>>> INDICES = (np.random.rand(COUNT,4)*COUNT).astype(int)
>>> SPLIT = POINT_CLOUD[INDICES.T]
>>> SPLIT.shape
(4, 1000000, 3)
>>>

Split a large numpy array into separate arrays with a list of grouped indices

Answers (1)

Related Questions