Reputation: 4482
I have the following numpy array
import numpy as np
arr = np.array([1,1,1,2,2,2,3,3,2,2,2,1,1,1,2,2])
I split this array into parts, where each part has the same value consequently using this question
def consecutive(data, stepsize=1):
return np.split(data, np.where(np.diff(data) != stepsize)[0]+1)
consecutive(arr, stepsize=0)
which yields
[array([1, 1, 1]),
array([2, 2, 2]),
array([3, 3]),
array([2, 2, 2]),
array([1, 1, 1]),
array([2, 2])]
I would like, for every sub-part above, if its (unique) element has appeared before, to add to this sub-part 0.001 * times_of_appearences_before_that
I tried this:
arr_f = []
times_appeared_dict = dict(zip([str(l) for l in list(np.unique(arr))], [-1]*len(list(np.unique(arr))))) # dictionary which will count the times of appearences
for sub_arr in consecutive(arr, stepsize=0):
arr_f.append(sub_arr)
arr_f_tmp = np.concatenate(arr_f).ravel()
if np.unique(sub_arr) in arr_f_tmp:
times_appeared_dict[str(np.unique(sub_arr)[0])] = times_appeared_dict[str(np.unique(sub_arr)[0])] + 1
# then add the 0.0001 to the elements, starting from the end
arr_ff = []
for sub_arr in reversed(consecutive(arr, stepsize=0)):
sub_arr_f = sub_arr + 0.0001*times_appeared_dict[str(np.unique(sub_arr)[0])]
times_appeared_dict[str(np.unique(sub_arr)[0])] = times_appeared_dict[str(np.unique(sub_arr)[0])] - 1
arr_ff.append(sub_arr_f)
arr_ff = np.concatenate(arr_ff).ravel()
# revert the order back to initial
arr_fff = []
for sub_arr in reversed(consecutive(arr_ff, stepsize=0)):
arr_fff.append(sub_arr)
arr_fff = np.concatenate(arr_fff).ravel()
arr_fff
which yields
array([1. , 1. , 1. , 2. , 2. , 2. , 3. , 3. ,
2.0001, 2.0001, 2.0001, 1.0001, 1.0001, 1.0001, 2.0002, 2.0002])
which is the correct result. I was wondering if there is a smarter way to do that (avoiding all these loops etc)
Upvotes: 2
Views: 717
Reputation: 1249
you could create the value , length pairs from you split out function and then just create your output, like this:
splits = consecutive(arr, stepsize=0)
l2 = np.array(list(map(lambda x: (x[0], len(x)) , splits)), dtype=float)
a, b = l2[:,0], l2[:,1]
for val in np.unique(a):
idx = np.where(a == val)[0]
a[idx] = val + np.arange(len(idx)) * 0.001
out = np.array([val for val, l in zip(a, b) for i in range(int(l))])
output:
array([1. , 1. , 1. , 2. , 2. , 2. , 3. , 3. , 2.001,
2.001, 2.001, 1.001, 1.001, 1.001, 2.002, 2.002])
Upvotes: 1
Reputation: 13316
Your code is quite convoluted:
So, let's comment
times_appeared_dict = dict(zip([str(l) for l in list(np.unique(arr))], [-1]*len(list(np.unique(arr))))) # dictionary which will count the times of appearences
# As Ahmed said, no need to use str(l) as key here. l is a better key, and we spare a `str`
times_appeared_dict = dict(zip([l for l in list(np.unique(arr))], [-1]*len(list(np.unique(arr)))))
# Also `np.unique(arr)` is iterable. No need to iterate list(np.unique(arr))
times_appeared_dict = dict(zip([l for l in np.unique(arr)], [-1]*len(np.unique(arr))))
# Since you use np.unique(arr) 2 times, let compute it once only
listUniq=np.unique(arr)
times_appeared_dict = dict(zip([l for l in listUniq], [-1]*len(listUniq)))
# [l for l in aList] is just aList
times_appeared_dict = dict(zip(listUniq, [-1]*len(listUniq)))
Now the next for loop
for sub_arr in consecutive(arr, stepsize=0):
arr_f.append(sub_arr)
arr_f_tmp = np.concatenate(arr_f).ravel()
if np.unique(sub_arr) in arr_f_tmp:
times_appeared_dict[str(np.unique(sub_arr)[0])] = times_appeared_dict[str(np.unique(sub_arr)[0])] + 1
# We have decided to use integer as keys, so let remove str
for sub_arr in consecutive(arr, stepsize=0):
arr_f.append(sub_arr)
arr_f_tmp = np.concatenate(arr_f).ravel()
if np.unique(sub_arr) in arr_f_tmp:
times_appeared_dict[np.unique(sub_arr)[0]] = times_appeared_dict[np.unique(sub_arr)[0]] + 1
# As we did for listUniq, since we use consecutive(arr, stepsize=0) several times, let's compute it once
conseq=consecutive(arr, stepsize=0)
for sub_arr in conseq:
arr_f.append(sub_arr)
arr_f_tmp = np.concatenate(arr_f).ravel()
if np.unique(sub_arr) in arr_f_tmp:
times_appeared_dict[np.unique(sub_arr)[0]] = times_appeared_dict[np.unique(sub_arr)[0]] + 1
# np.unique(sub_arr)[0] can't be anything else than sub_arr[0] (and sub_arr can't be empty)
conseq=consecutive(arr, stepsize=0)
for sub_arr in conseq:
arr_f.append(sub_arr)
arr_f_tmp = np.concatenate(arr_f).ravel()
if np.unique(sub_arr) in arr_f_tmp:
times_appeared_dict[sub_arr[0]] = times_appeared_dict[sub_arr[0]] + 1
# Since you just increase times_appeared_dict[sub_arr[0]] it is better to use +=1. It avoids double estimation of place where times_appeared_dict[sub_arr[0]] is stored sub_arr[0] (and sub_arr can`t be empty)
conseq=consecutive(arr, stepsize=0)
for sub_arr in conseq:
arr_f.append(sub_arr)
arr_f_tmp = np.concatenate(arr_f).ravel()
if np.unique(sub_arr) in arr_f_tmp:
times_appeared_dict[sub_arr[0]] += 1
# The test np.unique(sub_arr) in arr_f_tmp is void
# sub_arr has to be in arr_f_tmp, you put it there just a few line sooner
conseq=consecutive(arr, stepsize=0)
for sub_arr in conseq:
arr_f.append(sub_arr)
arr_f_tmp = np.concatenate(arr_f).ravel()
times_appeared_dict[sub_arr[0]] += 1
# So we don't really need arr_f_tmp
conseq=consecutive(arr, stepsize=0)
for sub_arr in conseq:
arr_f.append(sub_arr)
times_appeared_dict[sub_arr[0]] += 1
Next loop
for sub_arr in reversed(consecutive(arr, stepsize=0)):
sub_arr_f = sub_arr + 0.0001*times_appeared_dict[str(np.unique(sub_arr)[0])]
times_appeared_dict[str(np.unique(sub_arr)[0])] = times_appeared_dict[str(np.unique(sub_arr)[0])] - 1
arr_ff.append(sub_arr_f)
# Again, remove str, conseq
# replace np.unique(x)[0] by x[0], as before
# And, as for +=1 before, use -=1
for sub_arr in reversed(conseq):
sub_arr_f = sub_arr + 0.0001*times_appeared_dict[sub_arr[0]]
times_appeared_dict[sub_arr[0]] -= 1
arr_ff.append(sub_arr_f)
And the last one. All you are doing here is to
concatenate in reverse order content of arr_ff
arr_fff = []
for sub_arr in arr_ff[::-1]:
arr_fff.extend(sub_arr)
# Since I've use extend and not sub_arr, no need for concatenate/ravel. Just turn the list in ndarray
arr_fff = np.array(arr_fff)
(Note: I've check that each of these steps optimized the timing. Most of the time it is obvious, but for some it was not a priori sure)
So all together, this is still your code, with optimizations proposed (no change at all in the algorithm. Just a some python optimisation)
def orgopt(data):
arr_f = []
listUniq=np.unique(arr)
conseq=consecutive(arr, stepsize=0)
times_appeared_dict = dict(zip(listUniq, [-1]*len(listUniq))) # dictionary which will count the times of appearences
for sub_arr in conseq:
arr_f.append(sub_arr)
times_appeared_dict[sub_arr[0]]+=1
# then add the 0.0001 to the elements, starting from the end
arr_ff = []
for sub_arr in reversed(conseq):
sub_arr_f = sub_arr + 0.0001*times_appeared_dict[sub_arr[0]]
times_appeared_dict[sub_arr[0]] -= 1
arr_ff.append(sub_arr_f)
# revert the order back to initial
arr_fff = []
for sub_arr in arr_ff[::-1]:
arr_fff.extend(sub_arr)
return np.array(arr_fff)
Now, going further than just python optimization, one could wonder why would you count number of occurrence forward first, and then backward. It force you to work in 2 steps, and then to reverse the list. Hence the 3 loops.
You could just count the number of occurrences and the 0.0001 addition in the same loop, forward.
def forward(data):
r=[]
d={}
for l in consecutive(arr, stepsize=0):
k=d.get(l[0],0)
r.extend(l+0.0001*k)
d[l[0]] = k+1
return np.array(r)
That is still the same principle (using consecutive
to get sublist, and then counting in a dictionary. But simpler: we add sublists + 0.0001×number of previous iterations.
Note that dic.get(key, 0)
is just a way to avoid filling the dictionary first, and so avoid to compute uniq
: it returns 0 if key is absent, and the value associated to key otherwise. So, dictionary is implicitly initialized to 0.
This last version was not a python optimization, but an algorithm optimization. Yet it is still the same principle. See what we can do by thinking all the problem over.
You have an array
arr=np.array([1,1,1,2,2,2,3,3,2,2,2,1,1,1,2,2])
Let's add a sentinel at the beginning
arrs=np.concatenate(([-1],arr))
So now arrs=[-1,1,1,1,2,2,2,3,3,2,2,2,1,1,1,2,2]
Reason I a doing so is because now if I compute
np.diff(arrs)
I get a array of differences, of the same shape as arr (not 1 less), and starting with a non 0 : [ 2, 0, 0, 1, 0, 0, 1, 0, -1, 0, 0, -1, 0, 0, 1, 0]
So np.diff(arrs)!=0
is a boolean array
[ True, False, False, True, False, False, True, False, True, False, False, True, False, False, True, False]
And 1*(np.diff(arrs)!=0)
is the same but as 0/1 :
[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0]
So simply a 1 everywhere we encounter a new value (different from previous, including first), 0 otherwise
Now, multiply that array by say (arr==2)
, and we keep only the 1
where that new value is 2.
1*(np.diff(arrs)!=0)*(arr==2)
# [0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0]
Now, compute cumulative sum of that
np.cumsum(1*(np.diff(arrs)!=0)*(arr==2))
# Now we could loose the `1*` since cumsum of boolean array gives the same result
# [0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3]
So, what we get is the number of occurrence of series of 2. We wanted previous occurrences, so lets remove 1
np.cumsum((np.diff(arrs)!=0)*(arr==2))-1
# [-1, -1, -1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2]
If we ignore places where arr is not 2, we have number of previous occurrences of 2. Let's filter that by multliplying again by arr==2
(np.cumsum((np.diff(arrs)!=0)*(arr==2))-1)*(arr==2)
# array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 2, 2])
If we multiply that by 0.0001 we have what we need to add to the 2 values
Just do that for all values, not just 2.
def wcumsum(data):
arrs=np.concatenate(([-1],arr))
nn=(np.diff(arrs)!=0)*1
return arr+0.0001*sum((np.cumsum((arr==k)*nn)-1)*(arr==k) for k in np.unique(arr))
(On a list of 320 elements. Short list are less favorable to numpy version :-). On your 16 elements list, it was just the same, or almost so)
Methode | Timing |
---|---|
Your code | 6359 μs |
Optimized version of your code | 437 μs |
1 pass algorithm | 318 μs |
cumsum | 53 μs |
Upvotes: 3