Neuril
Neuril

Reputation: 183

python lists manipulation group by count

I need to implement an algorithm that take a list as an argument i.e.

[0, 0, 0, 0, 6, 6, 6, 6, 6, 6, 5, 5]

and does the following:

  1. check if the small sequences (i.e. the sequences that start with 0 or 5 are of length 0.5 proportion of the big sequence length i.e. the sequence [5,5] is of length 2 which is 0.33 of the big (6's) sequence length - 6

  2. and then merges the the small sequence into the big sequence the output of the above example should be

    [0, 0, 0, 0, 6, 6, 6, 6, 6, 6, 6, 6]

I have the following code which seem to be rather complicated and I'm sure can be redesigned more efficiently, and it also have a bug when the input includes repeating sequences i.e.

[0, 0, 0, 0, 6, 6, 6, 6, 6, 6, 0, 0]

Code:

def __radiate(self,threegroupslist,gr1,gr2,gr3):
        ### group by and count
        ### if chronologically the 2nd centroid seq. is below a threshold (function of the largest sequence)
        ### merge this sequence into the largest sequence
        inputlist = threegroupslist
        df = pd.DataFrame(threegroupslist)
        df.columns = ['cluster']
        groups = df.groupby('cluster').groups
        len1=len(groups.get(gr1))
        len2=len(groups.get(gr2))
        len3=len(groups.get(gr3))
        min(len1,len2,len3)/max(len1,len2,len3)
        minseq = min(len1,len2,len3)
        maxseq = max(len1,len2,len3)
        #map(lambda x: min(x), [i for i, x in enumerate(groups)])
        #if (minseq/maxseq < 0.5):
         #   indices = [i for i, x in enumerate(threegroupslist) if x == gr2]
          #  threegroupslist[indices] = gr1
        if (len1 > len3 and len2/len1 < 0.5):
            # merge sublist2 into sublist 1
            indices = [i for i, x in enumerate(threegroupslist) if x == gr2]
            threegroupslist[indices] = gr1
        elif (len1 <= len3 and len2/len3 < 0.5):
            # merge sublist2 into sublist 3
            indices = [i for i, x in enumerate(threegroupslist) if x == gr2]
            threegroupslist[indices] = gr3
        elif (len3/len2 < 0.5):# len2 is the biggest and len3 is the smallest
            # merge sublist3 into sublist 2
            indices = [i for i, x in enumerate(threegroupslist) if x == gr3]
            threegroupslist[indices] = gr2
        elif (len1/len2 < 0.5):# len2 is the biggest and len1 is the smallest:
            # merge sublist2 into sublist 1
            indices = [i for i, x in enumerate(threegroupslist) if x == gr1]
            threegroupslist[indices] = gr2
        else:
            pass
        return threegroupslist

I'm new to python so sorry if I made stupid coding mistakes

Upvotes: 0

Views: 220

Answers (2)

matjazzz144
matjazzz144

Reputation: 76

Does it have to be pandas? First thing that comes to mind is collections.Counter, a part of the standard library.

We also need the math module and then the sequence can be defined.

import math as mm

import collections as cl

sequences = [0,0,0,0,0,6,6,6,6,6,6,5,5]

Length of each sub-sequence is obtained using Counter. Then the sub-sequences are ordered by their length.

collected_sequences = cl.Counter(sequences)
collected_sequences_ordered = collected_sequences.most_common()
most_common_value, most_common_count = collected_sequences_ordered[0]
threshold = math.floor(most_common_count * 0.5)

List comprehensions come in handy, yet again :)

# [fixed comparison]    
final_result = [value if times > threshold else most_common_value
                for value, times in collected_sequences.items()
                for i in range(times)]

I didn't test this on other sequences. I also didn't bother with cases where two (longest) sequences have identical length. If you want to handle such cases in a special way, this routine has to be tweaked, but it shouldn't be too hard.

Update

Let's suppose the order of sequences matters. Then we can do the following

sequences_to_be_merged = [number for number, count in collected_sequences.items()
                          if count <= threshold]

final_result = [most_common_value if number in sequences_to_be_merged else number
                for number in sequences]

However, sequences won't necessarily merge this way. There is only one sequence that is the longest and all smaller sequences are converted but stay in the same place. OP's definition of what should happen is not very precise here.

Upvotes: 1

Lav
Lav

Reputation: 2274

This is what I can suggest so far:

import operator

def compress(sequence):
    """Packs source sequence into sub-sequences and finds the longest"""
    if not sequence: return []
    max_length = 0
    max_item = None
    cur_length = 0
    cur_item = sequence[0]
    compressed = []
    for item in sequence:
        if item == cur_item:
            cur_length += 1
        else:
            compressed.append((cur_item, cur_length))
            if cur_length >= max_length:
                max_length, max_item = cur_length, cur_item
            cur_item, cur_length = item, 1
    # Forgot to handle the last sequence
    compressed.append((cur_item, cur_length))
    if cur_length >= max_length:
        max_length, max_item = cur_length, cur_item
    # End of fixes
    return compressed, max_length, max_item

if __name__ == '__main__':
    sequence = [1,1,1,1,2,2,3,3,6,6,6,6,6,6,6,6,4,4,4,4,4,3,3,7,7,7,1,1]
    compressed, max_length, max_item = compress(sequence)

    print 'original:', sequence
    print 'compressed:', compressed
    print 'max_length:', max_length, 'max_item:', max_item
    print 'uncompressed:', [[item] * length for item, length in compressed] # This is for example only

    pre_result = [[max_item if length * 2 <= max_length else item] * length for item, length in compressed]
    print 'uncompressed with replace:', pre_result
    result = reduce(operator.add, pre_result) # Join all sequences into one list again
    print 'result:', result

I'm dumping tons of intermediate information to demonstrate how the processing goes. Resulting output:

original: [1, 1, 1, 1, 2, 2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 3, 3, 7, 7, 7, 1, 1]
compressed: [(1, 4), (2, 2), (3, 2), (6, 8), (4, 5), (3, 2), (7, 3)]
max_length: 8 max_item: 6
uncompressed: [[1, 1, 1, 1], [2, 2], [3, 3], [6, 6, 6, 6, 6, 6, 6, 6], [4, 4, 4, 4, 4], [3, 3], [7, 7, 7]]
uncompressed with replace: [[6, 6, 6, 6], [6, 6], [6, 6], [6, 6, 6, 6, 6, 6, 6, 6], [4, 4, 4, 4, 4], [6, 6], [6, 6, 6]]
result: [6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6]

I'm not sure about your memory constraints, but for now I assume you can afford to create extra lists. If you can't, then the solution will involve a lot more list crawling by index and a lot less list comprehension.

Upvotes: 0

Related Questions