Reputation: 135

Get two neighboring non-nan values in numpy array

Let's say I have a numpy array

my_array = [0.2, 0.3, nan, nan, nan, 0.1, nan, 0.5, nan]

For each nan value, I want to extract the two non-nan values to the left and right of that point (or single value if appropriate). So I would like my output to be something like

output = [[0.3,0.1], [0.3,0.1], [0.3,0.1], [0.1,0.5], [0.5]]

I was thinking of looping through all the values in my_array, then finding those that are nan, but I'm not sure how to do the next part of finding the nearest non-nan values.

Upvotes: 1

Answers (4)

Ali_Sh

Reputation: 2816

I was eager to check how could use just NumPy to solve this problem as an exercise. After some hours I could reach a solution :), but as I think it will be inefficient comparing to pandas as mentioned by Mozway, I didn't optimized the code further (it can be optimized; if conditions may could be cured and merged in other sections):

my_array = np.array([np.nan, np.nan, 0.2, 0.3, np.nan, np.nan, np.nan, 0.1, 0.7, np.nan, 0.5])

nans = np.isnan(my_array).astype(np.int8)           # [1 1 0 0 1 1 1 0 0 1 0]
zeros = np.where(nans == 0)[0]                      # [ 2  3  7  8 10]
diff_nan = np.diff(nans)                            # [ 0 -1  0  1  0  0 -1  0  1 -1]
start = np.where(diff_nan == 1)[0]                  # [3 8]
end = np.where(diff_nan == -1)[0] + 1               # [ 2  7 10]

mask_start_nan = np.isnan(my_array[0])              # True
mask_end_nan = np.isnan(my_array[-1])               # False

if mask_end_nan: start = start[:-1]                 # [3 8]
if mask_start_nan: end = end[1:]                    # [ 7 10]
inds = np.dstack([start, end]).squeeze()            # [[ 3  7] [ 8 10]]
initial = my_array[inds]                            # [[0.3 0.1] [0.7 0.5]]

repeats = np.diff(np.where(np.concatenate(([nans[0]], nans[:-1] != nans[1:], [True])))[0])[::2]    # [2 3 1]
if mask_end_nan: repeats = repeats[:-1]             # [2 3 1]
if mask_start_nan: repeats = repeats[1:]            # [3 1]

result = np.repeat(initial, repeats, axis=0)        # [[0.3 0.1] [0.3 0.1] [0.3 0.1] [0.7 0.5]]
if mask_end_nan: result = np.array([*result, np.array(my_array[zeros[-1]])], dtype=object)
if mask_start_nan: result = np.array([np.array(my_array[zeros[0]]), *result], dtype=object)

# [array(0.2) array([0.3, 0.1]) array([0.3, 0.1]) array([0.3, 0.1]) array([0.7, 0.5])]

I don't know if there be a much easier solution by NumPy; I implemented what came to my mind. I believe that this code can be greatly improved (I will do it if I find a free time).

Upvotes: 1

ddejohn

Reputation: 8962

For the sake of education, I'll post a pretty straight-forward algorithm for achieving this result, which works by finding the closest index of a value to the left and to the right of each index of a NaN, and filters out any infs at the end:

def get_neighbors(x: np.ndarray) -> list:
    mask = np.isnan(x)
    nan_idxs, *_ = np.where(mask)
    val_idxs, *_ = np.where(~mask)

    neighbors = []
    for nan_idx in nan_idxs:
        L, R = -float("inf"), float("inf")
        for val_idx in val_idxs:
            if val_idx < nan_idx:
                L = max(L, val_idx)
            else:
                R = min(R, val_idx)
        # casting to list isn't strictly necessary, you'll just end up with a list of arrays
        neighbors.append(list(x[[i for i in (L, R) if i > 0 and i < float("inf")]]))

    return neighbors

Output:

>>> get_neighbors(my_array)
[[0.3, 0.1], [0.3, 0.1], [0.3, 0.1], [0.1, 0.5], [0.5]]

The nested for loop has a worst-case runtime of O((n / 2)^2) where n is the number of elements of x (worst case occurs when exactly half the elements are NaN).

Upvotes: 1

mozway

Reputation: 260490

Using pandas and numpy:

s = pd.Series([0.2, 0.3, nan, nan, nan, 0.1, nan, 0.5, nan])
m = s.isna()
a = np.vstack((s.ffill()[m], s.bfill()[m]))
out = a[:,~np.isnan(a).any(0)].T.tolist()

Output:

[[0.3, 0.1], [0.3, 0.1], [0.3, 0.1], [0.1, 0.5]]

NB. You can choose to keep or drop the lists containing NaNs.

With NaNs:

out = a.T.tolist()

[[0.3, 0.1], [0.3, 0.1], [0.3, 0.1], [0.1, 0.5], [0.5, nan]]

alternative to handle the single elements:

s = pd.Series([0.2, 0.3, nan, nan, nan, 0.1, nan, 0.5, nan])
m = s.isna()

(pd
 .concat((s.ffill()[m], s.bfill()[m]), axis=1)
 .stack()
 .groupby(level=0).agg(list)
 .to_list()
 )

Output:

[[0.3, 0.1], [0.3, 0.1], [0.3, 0.1], [0.1, 0.5], [0.5]]

Upvotes: 3

Marat

Reputation: 15738

Less elegant than @mozway's answer, but the last list only has one element:

pd.DataFrame({
    'left':arr.ffill(), 
    'right': arr.bfill()
}).loc[arr.isna()].apply(lambda row: row.dropna().to_list(), axis=1).to_list()

Upvotes: 3

Get two neighboring non-nan values in numpy array

Answers (4)

alternative to handle the single elements:

Related Questions