Reputation: 470
I have some data captured in a 3 dimensional ndarray, with the dimensions of: time tick, number of samples, and number of values per sample.
Through a previous operation, I know that at a certain time tick, whether each sample number becomes invalid. In cases where this never occurs, the number is set to -1. In other cases, it indicates the time tick at which the samples become invalid.
What I'd like to be able to do is either blank out the rest of the columns, either by setting columns including and to the right of the invalid data to nans, or by using some masking or indexing technique that results in keeping only the data to the left.
I've read about or found references to similar problems involving fancy indexing, slice()
, boolean arrays, and masked arrays, but I'm not seeing a way to accomplish my goal.
import numpy as np
# dimensions are timestep, sample, and values per sample. To make it easy, let's
# do 3 time steps, 4 samples, and 2 values per sample.
data = np.array( [
[ # Timestep 0
[ 1, 2 ], # Sample 1
[ 3, 4 ], # 2
[ 5, 6 ], # 3
[ 7, 8 ], # 4
],
[ # Timestep 1
[ 1, 2 ],
[ 3, 4 ],
[ 5, 6 ],
[ 7, 8 ],
],
[ # Timestep 2
[ 1, 2 ],
[ 3, 4 ],
[ 5, 6 ],
[ 7, 8 ],
],
])
Each sample may become invalid at a timestep. If the timestep is never invalid, the value is -1.
invalid_at = [
0, # becomes invalid at timestep 0
2, # 2
1, # 1
-1 # Never is invalid
]
If for example we replace invalid values with nan, then the resulting array should look like
data = np.array( [
[ # Timestep 0
[ n, n ], # Sample 1
[ 3, 4 ], # 2
[ 5, 6 ], # 3
[ 7, 8 ], # 4
],
[ # Timestep 1
[ n, n ],
[ 3, 4 ],
[ n, n ],
[ 7, 8 ],
],
[ # Timestep 2
[ n, n ],
[ n, n ],
[ n, n ],
[ 7, 8 ],
],
])
The chief difficulty I'm running into is that I have the start index, but I can't find a way to create a slice (using fancy indexing or otherwise) that will let me assign to it.
For example, the following doesn't work:
data[ :, invalid_at:-1, : ] = np.nan
What I was hoping would happen is that the invalid at array would be evaluated and generate a per-row slice.
I could do this with a for loop, but I'd prefer to keep it vectorized for speed and later scalability. Any ideas?
Upvotes: 1
Views: 325
Reputation: 114330
There are a couple of possible ways to do this. The main issue is that the index you are trying to apply is ragged.
If the number of samples is small, you can loop over them with relatively little additional overhead. This option is extremely straightforward, and supports simple slice indexing, which is generally the fastest kind of indexing since it doesn't require additional copies of data or masks:
for sample, step in enumerate(invalid_at):
if step < 0:
continue
data[step:, sample, :] = np.nan
If you really need to do this in one step, you can construct a mask and apply it. The array has dimensions (timestep, sample, x). The mask only needs the first two dimensions. You need to set up a condition like "if an element is at timestep t
and sample is greater than or equal to invalid_at[t]
, set the element to True
. The condition can be applied to a pair of broadcasted arrays: one for timestep and one for sample:
trange = np.arange(data.shape[0]).reshape(-1, 1)
srange = np.array(invalid_at).reshape(1, -1)
srange[srange == -1] = data.shape[0]
mask = (trange >= srange)
data[mask, :] = np.nan
This will only work if you explicitly set dtype=np.float
or similar for data
, since the currently defined integer to does not support NaNs.
Upvotes: 1