Reputation: 183
I need to implement an algorithm that take a list as an argument i.e.
[0, 0, 0, 0, 6, 6, 6, 6, 6, 6, 5, 5]
and does the following:
check if the small sequences (i.e. the sequences that start with 0 or 5 are of length 0.5 proportion of the big sequence length i.e. the sequence [5,5] is of length 2 which is 0.33 of the big (6's) sequence length - 6
and then merges the the small sequence into the big sequence the output of the above example should be
[0, 0, 0, 0, 6, 6, 6, 6, 6, 6, 6, 6]
I have the following code which seem to be rather complicated and I'm sure can be redesigned more efficiently, and it also have a bug when the input includes repeating sequences i.e.
[0, 0, 0, 0, 6, 6, 6, 6, 6, 6, 0, 0]
Code:
def __radiate(self,threegroupslist,gr1,gr2,gr3):
### group by and count
### if chronologically the 2nd centroid seq. is below a threshold (function of the largest sequence)
### merge this sequence into the largest sequence
inputlist = threegroupslist
df = pd.DataFrame(threegroupslist)
df.columns = ['cluster']
groups = df.groupby('cluster').groups
len1=len(groups.get(gr1))
len2=len(groups.get(gr2))
len3=len(groups.get(gr3))
min(len1,len2,len3)/max(len1,len2,len3)
minseq = min(len1,len2,len3)
maxseq = max(len1,len2,len3)
#map(lambda x: min(x), [i for i, x in enumerate(groups)])
#if (minseq/maxseq < 0.5):
# indices = [i for i, x in enumerate(threegroupslist) if x == gr2]
# threegroupslist[indices] = gr1
if (len1 > len3 and len2/len1 < 0.5):
# merge sublist2 into sublist 1
indices = [i for i, x in enumerate(threegroupslist) if x == gr2]
threegroupslist[indices] = gr1
elif (len1 <= len3 and len2/len3 < 0.5):
# merge sublist2 into sublist 3
indices = [i for i, x in enumerate(threegroupslist) if x == gr2]
threegroupslist[indices] = gr3
elif (len3/len2 < 0.5):# len2 is the biggest and len3 is the smallest
# merge sublist3 into sublist 2
indices = [i for i, x in enumerate(threegroupslist) if x == gr3]
threegroupslist[indices] = gr2
elif (len1/len2 < 0.5):# len2 is the biggest and len1 is the smallest:
# merge sublist2 into sublist 1
indices = [i for i, x in enumerate(threegroupslist) if x == gr1]
threegroupslist[indices] = gr2
else:
pass
return threegroupslist
I'm new to python so sorry if I made stupid coding mistakes
Upvotes: 0
Views: 220
Reputation: 76
Does it have to be pandas? First thing that comes to mind is collections.Counter
, a part of the standard library.
We also need the math
module and then the sequence can be defined.
import math as mm
import collections as cl
sequences = [0,0,0,0,0,6,6,6,6,6,6,5,5]
Length of each sub-sequence is obtained using Counter
. Then the sub-sequences are ordered by their length.
collected_sequences = cl.Counter(sequences)
collected_sequences_ordered = collected_sequences.most_common()
most_common_value, most_common_count = collected_sequences_ordered[0]
threshold = math.floor(most_common_count * 0.5)
List comprehensions come in handy, yet again :)
# [fixed comparison]
final_result = [value if times > threshold else most_common_value
for value, times in collected_sequences.items()
for i in range(times)]
I didn't test this on other sequences. I also didn't bother with cases where two (longest) sequences have identical length. If you want to handle such cases in a special way, this routine has to be tweaked, but it shouldn't be too hard.
Update
Let's suppose the order of sequences matters. Then we can do the following
sequences_to_be_merged = [number for number, count in collected_sequences.items()
if count <= threshold]
final_result = [most_common_value if number in sequences_to_be_merged else number
for number in sequences]
However, sequences won't necessarily merge this way. There is only one sequence that is the longest and all smaller sequences are converted but stay in the same place. OP's definition of what should happen is not very precise here.
Upvotes: 1
Reputation: 2274
This is what I can suggest so far:
import operator
def compress(sequence):
"""Packs source sequence into sub-sequences and finds the longest"""
if not sequence: return []
max_length = 0
max_item = None
cur_length = 0
cur_item = sequence[0]
compressed = []
for item in sequence:
if item == cur_item:
cur_length += 1
else:
compressed.append((cur_item, cur_length))
if cur_length >= max_length:
max_length, max_item = cur_length, cur_item
cur_item, cur_length = item, 1
# Forgot to handle the last sequence
compressed.append((cur_item, cur_length))
if cur_length >= max_length:
max_length, max_item = cur_length, cur_item
# End of fixes
return compressed, max_length, max_item
if __name__ == '__main__':
sequence = [1,1,1,1,2,2,3,3,6,6,6,6,6,6,6,6,4,4,4,4,4,3,3,7,7,7,1,1]
compressed, max_length, max_item = compress(sequence)
print 'original:', sequence
print 'compressed:', compressed
print 'max_length:', max_length, 'max_item:', max_item
print 'uncompressed:', [[item] * length for item, length in compressed] # This is for example only
pre_result = [[max_item if length * 2 <= max_length else item] * length for item, length in compressed]
print 'uncompressed with replace:', pre_result
result = reduce(operator.add, pre_result) # Join all sequences into one list again
print 'result:', result
I'm dumping tons of intermediate information to demonstrate how the processing goes. Resulting output:
original: [1, 1, 1, 1, 2, 2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 3, 3, 7, 7, 7, 1, 1]
compressed: [(1, 4), (2, 2), (3, 2), (6, 8), (4, 5), (3, 2), (7, 3)]
max_length: 8 max_item: 6
uncompressed: [[1, 1, 1, 1], [2, 2], [3, 3], [6, 6, 6, 6, 6, 6, 6, 6], [4, 4, 4, 4, 4], [3, 3], [7, 7, 7]]
uncompressed with replace: [[6, 6, 6, 6], [6, 6], [6, 6], [6, 6, 6, 6, 6, 6, 6, 6], [4, 4, 4, 4, 4], [6, 6], [6, 6, 6]]
result: [6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6]
I'm not sure about your memory constraints, but for now I assume you can afford to create extra lists. If you can't, then the solution will involve a lot more list crawling by index and a lot less list comprehension.
Upvotes: 0