Split numpy array based on sequences of same neighbouring values

Question

I have the following numpy array

import numpy as np
arr = np.array([1,1,1,2,2,2,3,3,2,2,2,1,1,1,2,2])

I split this array into parts, where each part has the same value consequently using this question

def consecutive(data, stepsize=1):
    return np.split(data, np.where(np.diff(data) != stepsize)[0]+1)

consecutive(arr, stepsize=0)

which yields

[array([1, 1, 1]),
 array([2, 2, 2]),
 array([3, 3]),
 array([2, 2, 2]),
 array([1, 1, 1]),
 array([2, 2])]

I would like, for every sub-part above, if its (unique) element has appeared before, to add to this sub-part 0.001 * times_of_appearences_before_that

I tried this:

arr_f = []
times_appeared_dict = dict(zip([str(l) for l in list(np.unique(arr))], [-1]*len(list(np.unique(arr))))) # dictionary which will count the times of appearences
for sub_arr in consecutive(arr, stepsize=0):
    arr_f.append(sub_arr)
    arr_f_tmp = np.concatenate(arr_f).ravel()
    
    if np.unique(sub_arr) in arr_f_tmp:
        times_appeared_dict[str(np.unique(sub_arr)[0])] = times_appeared_dict[str(np.unique(sub_arr)[0])] + 1

# then add the 0.0001 to the elements, starting from the end
arr_ff = []
for sub_arr in reversed(consecutive(arr, stepsize=0)):
    sub_arr_f = sub_arr + 0.0001*times_appeared_dict[str(np.unique(sub_arr)[0])]
    times_appeared_dict[str(np.unique(sub_arr)[0])] = times_appeared_dict[str(np.unique(sub_arr)[0])] - 1
    arr_ff.append(sub_arr_f)

arr_ff = np.concatenate(arr_ff).ravel()    

# revert the order back to initial
arr_fff = []
for sub_arr in reversed(consecutive(arr_ff, stepsize=0)):
    arr_fff.append(sub_arr)
    
arr_fff = np.concatenate(arr_fff).ravel()
arr_fff

which yields

array([1.    , 1.    , 1.    , 2.    , 2.    , 2.    , 3.    , 3.    ,
   2.0001, 2.0001, 2.0001, 1.0001, 1.0001, 1.0001, 2.0002, 2.0002])

which is the correct result. I was wondering if there is a smarter way to do that (avoiding all these loops etc)

chrslg · Accepted Answer

Analysis of your code

Your code is quite convoluted:

So, let's comment

times_appeared_dict = dict(zip([str(l) for l in list(np.unique(arr))], [-1]*len(list(np.unique(arr))))) # dictionary which will count the times of appearences
# As Ahmed said, no need to use str(l) as key here. l is a better key, and we spare a `str`
times_appeared_dict = dict(zip([l for l in list(np.unique(arr))], [-1]*len(list(np.unique(arr)))))
# Also `np.unique(arr)` is iterable. No need to iterate list(np.unique(arr))
times_appeared_dict = dict(zip([l for l in np.unique(arr)], [-1]*len(np.unique(arr))))
# Since you use np.unique(arr) 2 times, let compute it once only
listUniq=np.unique(arr)
times_appeared_dict = dict(zip([l for l in listUniq], [-1]*len(listUniq)))
# [l for l in aList] is just aList
times_appeared_dict = dict(zip(listUniq, [-1]*len(listUniq)))

Now the next for loop

for sub_arr in consecutive(arr, stepsize=0):
    arr_f.append(sub_arr)
    arr_f_tmp = np.concatenate(arr_f).ravel()
    if np.unique(sub_arr) in arr_f_tmp:
        times_appeared_dict[str(np.unique(sub_arr)[0])] = times_appeared_dict[str(np.unique(sub_arr)[0])] + 1

# We have decided to use integer as keys, so let remove str
for sub_arr in consecutive(arr, stepsize=0):
    arr_f.append(sub_arr)
    arr_f_tmp = np.concatenate(arr_f).ravel()
    if np.unique(sub_arr) in arr_f_tmp:
        times_appeared_dict[np.unique(sub_arr)[0]] = times_appeared_dict[np.unique(sub_arr)[0]] + 1

# As we did for listUniq, since we use consecutive(arr, stepsize=0) several times, let's compute it once
conseq=consecutive(arr, stepsize=0)
for sub_arr in conseq:
    arr_f.append(sub_arr)
    arr_f_tmp = np.concatenate(arr_f).ravel()
    if np.unique(sub_arr) in arr_f_tmp:
        times_appeared_dict[np.unique(sub_arr)[0]] = times_appeared_dict[np.unique(sub_arr)[0]] + 1

# np.unique(sub_arr)[0] can't be anything else than sub_arr[0] (and sub_arr can't be empty)
conseq=consecutive(arr, stepsize=0)
for sub_arr in conseq:
    arr_f.append(sub_arr)
    arr_f_tmp = np.concatenate(arr_f).ravel()
    if np.unique(sub_arr) in arr_f_tmp:
        times_appeared_dict[sub_arr[0]] = times_appeared_dict[sub_arr[0]] + 1

# Since you just increase times_appeared_dict[sub_arr[0]] it is better to use +=1. It avoids double estimation of place where times_appeared_dict[sub_arr[0]] is stored sub_arr[0] (and sub_arr can`t be empty)
conseq=consecutive(arr, stepsize=0)
for sub_arr in conseq:
    arr_f.append(sub_arr)
    arr_f_tmp = np.concatenate(arr_f).ravel()
    if np.unique(sub_arr) in arr_f_tmp:
        times_appeared_dict[sub_arr[0]] += 1

# The test np.unique(sub_arr) in arr_f_tmp is void
# sub_arr has to be in arr_f_tmp, you put it there just a few line sooner
conseq=consecutive(arr, stepsize=0)
for sub_arr in conseq:
    arr_f.append(sub_arr)
    arr_f_tmp = np.concatenate(arr_f).ravel()
    times_appeared_dict[sub_arr[0]] += 1

# So we don't really need arr_f_tmp
conseq=consecutive(arr, stepsize=0)
for sub_arr in conseq:
    arr_f.append(sub_arr)
    times_appeared_dict[sub_arr[0]] += 1

Next loop

for sub_arr in reversed(consecutive(arr, stepsize=0)):
    sub_arr_f = sub_arr + 0.0001*times_appeared_dict[str(np.unique(sub_arr)[0])]
    times_appeared_dict[str(np.unique(sub_arr)[0])] = times_appeared_dict[str(np.unique(sub_arr)[0])] - 1
    arr_ff.append(sub_arr_f)

# Again, remove str, conseq
# replace np.unique(x)[0] by x[0], as before
# And, as for +=1 before, use -=1
for sub_arr in reversed(conseq):
    sub_arr_f = sub_arr + 0.0001*times_appeared_dict[sub_arr[0]]
    times_appeared_dict[sub_arr[0]] -= 1
    arr_ff.append(sub_arr_f)

And the last one. All you are doing here is to concatenate in reverse order content of arr_ff

arr_fff = []
for sub_arr in arr_ff[::-1]:
    arr_fff.extend(sub_arr)

# Since I've use extend and not sub_arr, no need for concatenate/ravel. Just turn the list in ndarray    
arr_fff = np.array(arr_fff)

(Note: I've check that each of these steps optimized the timing. Most of the time it is obvious, but for some it was not a priori sure)

Optimized version of your code

So all together, this is still your code, with optimizations proposed (no change at all in the algorithm. Just a some python optimisation)

def orgopt(data):
    arr_f = []
    listUniq=np.unique(arr)
    conseq=consecutive(arr, stepsize=0)
    times_appeared_dict = dict(zip(listUniq, [-1]*len(listUniq))) # dictionary which will count the times of appearences
    for sub_arr in conseq:
        arr_f.append(sub_arr)
        times_appeared_dict[sub_arr[0]]+=1

    # then add the 0.0001 to the elements, starting from the end
    arr_ff = []
    for sub_arr in reversed(conseq):
        sub_arr_f = sub_arr + 0.0001*times_appeared_dict[sub_arr[0]]
        times_appeared_dict[sub_arr[0]] -= 1
        arr_ff.append(sub_arr_f)

    # revert the order back to initial
    arr_fff = []
    for sub_arr in arr_ff[::-1]:
        arr_fff.extend(sub_arr)

    return np.array(arr_fff)

Optimisation of algorithm

Now, going further than just python optimization, one could wonder why would you count number of occurrence forward first, and then backward. It force you to work in 2 steps, and then to reverse the list. Hence the 3 loops.

You could just count the number of occurrences and the 0.0001 addition in the same loop, forward.

def forward(data):
    r=[]
    d={}
    for l in consecutive(arr, stepsize=0):
        k=d.get(l[0],0)
        r.extend(l+0.0001*k)
        d[l[0]] = k+1
    return np.array(r)

That is still the same principle (using consecutive to get sublist, and then counting in a dictionary. But simpler: we add sublists + 0.0001×number of previous iterations. Note that dic.get(key, 0) is just a way to avoid filling the dictionary first, and so avoid to compute uniq: it returns 0 if key is absent, and the value associated to key otherwise. So, dictionary is implicitly initialized to 0.

Numpy version

This last version was not a python optimization, but an algorithm optimization. Yet it is still the same principle. See what we can do by thinking all the problem over.

You have an array arr=np.array([1,1,1,2,2,2,3,3,2,2,2,1,1,1,2,2])

Let's add a sentinel at the beginning

arrs=np.concatenate(([-1],arr))

So now arrs=[-1,1,1,1,2,2,2,3,3,2,2,2,1,1,1,2,2]

Reason I a doing so is because now if I compute np.diff(arrs) I get a array of differences, of the same shape as arr (not 1 less), and starting with a non 0 : [ 2, 0, 0, 1, 0, 0, 1, 0, -1, 0, 0, -1, 0, 0, 1, 0]

So np.diff(arrs)!=0 is a boolean array [ True, False, False, True, False, False, True, False, True, False, False, True, False, False, True, False]

And 1*(np.diff(arrs)!=0) is the same but as 0/1 : [1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0]

So simply a 1 everywhere we encounter a new value (different from previous, including first), 0 otherwise

Now, multiply that array by say (arr==2), and we keep only the 1 where that new value is 2.

1*(np.diff(arrs)!=0)*(arr==2)
# [0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0]

Now, compute cumulative sum of that

np.cumsum(1*(np.diff(arrs)!=0)*(arr==2))
# Now we could loose the `1*` since cumsum of boolean array gives the same result
# [0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3]

So, what we get is the number of occurrence of series of 2. We wanted previous occurrences, so lets remove 1

np.cumsum((np.diff(arrs)!=0)*(arr==2))-1
# [-1, -1, -1,  0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  2,  2]

If we ignore places where arr is not 2, we have number of previous occurrences of 2. Let's filter that by multliplying again by arr==2

(np.cumsum((np.diff(arrs)!=0)*(arr==2))-1)*(arr==2)
# array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 2, 2])

If we multiply that by 0.0001 we have what we need to add to the 2 values

Just do that for all values, not just 2.

def wcumsum(data):
    arrs=np.concatenate(([-1],arr))
    nn=(np.diff(arrs)!=0)*1
    return arr+0.0001*sum((np.cumsum((arr==k)*nn)-1)*(arr==k) for k in np.unique(arr))

Timings

(On a list of 320 elements. Short list are less favorable to numpy version :-). On your 16 elements list, it was just the same, or almost so)

Methode	Timing
Your code	6359 μs
Optimized version of your code	437 μs
1 pass algorithm	318 μs
cumsum	53 μs