Reputation: 4148
I have a list/array that looks something like this:
[ 0 1 2 3 4 5 6 7 3 9 10 11 13 13 14 15 16 17 18 19 4 16 22 5 3
2 10 17 34 5 11 18 27 14 11 15 29 2 11 10 19 32 8 27 1 32 6 2 0]
This list is supposed to be monotonic (strictly increasing). It is not, but you can see that it is mostly increasing. The values that does not fit into this pattern can be considered as noise, and I want them removed. So I want to extract the largest possible subset of this list which will be a strictly increasing sequence of numbers. There are many possible monotonic sequences here, but the point is to find the largest possible one.
It is important that I get the indices of the values to be removed,
as I need to know the exact position of the remaining numbers
(so instead of removing numbers we can replace them with
f.ex. None
, nan
, or -1
).
I can not change the order of any number, just remove the ones that does not fit in.
The remaining list has to be strictly increasing,
so if we have f.ex. [11 13 13 14]
, both of the 13s have to be removed.
If there are several possible solutions that are equally large,
we cannot use any of them and must choose a solution with 1 number less.
F.ex. in [27 29 30 34 32]
we have to throw away both 34 and 32,
because we cannot choose one over the other.
If we have [27 29 34 15 32]
there is no possible solution,
because we cannot choose between [27 29]
, [27 34]
, [29 34]
, or [15 32]
.
The best possible solution to the list presented above would be this:
[ 0 1 2 3 4 5 6 7 -1 9 10 11 -1 -1 14 15 16 17 18 19 -1 -1 22 -1 -1
-1 -1 -1 -1 -1 -1 -1 27 -1 -1 -1 29 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]
Can anyone think of an algorithm that would do this specifc job? If you can bring me a part on the way that would also be appreciated.
My only idea so far is a loop for n in range(N, 0, -1):
where N
is the size of the list.
The loop would first try to find solutions of size n=N
,
and then for n=N-1
, n=N-2
, etc.
When it find exactly 1 solution for a specifc n
it stops and
returns that solution. I'm not sure what should be inside the loop yet.
UPDATE:
Another SO question provides a Python algorithm for finding the longest subsequence of a list. This is almost what I want to do, but not quite.
I have copied that function (see below) and added a little extra code at the end which
changed the ouput if fullsize=True
.
Then the original sequence with its original shape is rebuilt,
but the numbers which are not part of the increasing sequence are replaced
by nans. And then I check if any number occurs more than once,
and if so, replace all occurences of that number with nans.
The original algorithm must still be changed since it does not provide unique solutions.
For example:
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 32,
18, 19, 20, 16, 35, 35, 33, 32, 1, 35, 13, 5, 32, 8, 35, 29, 19,
35, 19, 28, 32, 18, 31, 13, 3, 32, 33, 35, 31, 0, 21]
print subsequence(a)
gives
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
15. 16. 32. nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan]
Instead of ending with .. 16 32 nan ..
it should have ended with
... 16 nan ... nan 31 nan nan 32 33 35 nan nan nan]
,
as far as I can see.
Simpler example:
a = [0,1,2,3,4,1,2,3,4,5]
print subsequence(a)
gives
[ 0. 1. 2. 3. nan nan nan nan nan 5.]
but it should only have given
[0 nan ... nan 5]
because 1 2 3 4
appears two times and is not unique.
Here comes the current semi-working version of the code (which was used for my example runs):
import numpy as np
def subsequence(seq, fullsize=True):
"""
Credit:
http://stackoverflow.com/questions/3992697/longest-increasing-subsequence
"""
M = [None] * len(seq) # offset by 1 (j -> j-1)
P = [None] * len(seq)
# Since we have at least one element in our list, we can start by
# knowing that the there's at least an increasing subsequence of length one:
# the first element.
L = 1
M[0] = 0
# Looping over the sequence starting from the second element
for i in range(1, len(seq)):
# Binary search: we want the largest j <= L
# such that seq[M[j]] < seq[i] (default j = 0),
# hence we want the lower bound at the end of the search process.
lower = 0
upper = L
# Since the binary search will not look at the upper bound value,
# we'll have to check that manually
if seq[M[upper-1]] < seq[i]:
j = upper
else:
# actual binary search loop
while upper - lower > 1:
mid = (upper + lower) // 2
if seq[M[mid-1]] < seq[i]:
lower = mid
else:
upper = mid
j = lower # this will also set the default value to 0
P[i] = M[j-1]
if j == L or seq[i] < seq[M[j]]:
M[j] = i
L = max(L, j+1)
# Building the result: [seq[M[L-1]], seq[P[M[L-1]]], seq[P[P[M[L-1]]]], ...]
result = []
pos = M[L-1]
for _ in range(L):
result.append(seq[pos])
pos = P[pos]
result = np.array(result[::-1]) # reversing
if not fullsize:
return result # Original return from other SO question.
# This was written by me, PaulMag:
# Rebuild original sequence
subseq = np.zeros(len(seq)) * np.nan
for a in result:
for i, b in enumerate(seq):
if a == b:
subseq[i] = a
elif b > a:
break
if np.sum(subseq[np.where(subseq == a)].size) > 1: # Remove duplicates.
subseq[np.where(subseq == a)] = np.nan
return subseq # Alternative return made by me, PaulMag.
Upvotes: 0
Views: 740
Reputation: 11968
It's a classical dynamic programming problem.
You store for every element the length of the largest sequence that ends at that element. For the first element the value is 1 (just take that element). For the rest you take max(1, 1 + the value assigned to some other previous element that is <= then you current element).
You can implement with 2 loops (O(N^2)). There are probably some optimizations you can do if your data is really large. Or knowing your sequence is mostly good only check for the previous X elements.
To fix your data you start with one of the maximum values assigned (that the length of the longest monotonous sequence), you replace with -1 everything after that then go backward through the list looking for the previous element in the sequence (should be <= then the current one and the assigned value should be -1 what the current element is assigned), while you don't find a match, that element doesn't belong. When you find a match you take it as the current and continue backwards until you find an element you've assigned 1 to (that's the first one).
Upvotes: 2