George Orwell
George Orwell

Reputation: 303

Efficiently removing consecutive pair duplicates in Python?

I have a bunch of long lists (millions of elements long) that contain a time value and a temperature value ([time, temperature]). The lists look like this:

mylist = [[1, 72], [2, 75], [3, 74], [4, 75], [5, 74], [6, 75], [7, 79], [8, 71], [9, 79], [10, 71], [11, 75], [12, 74]]

What I want to do is get rid of consecutive pair duplicates. If a pair of temperatures are repeated consecutively, get rid of these duplicate elements (keeping only one).

That wording may be kind of confusing, so I'll provide an example using mylist:

mylist[0] and mylist[1] are consecutive pairs. Same with mylist[1] and mylist[2], and so on.

Moving on. Now, look at the temperature value from mylist. From mylist[0] all the way to mylist[11], the temperature values are:

72 75 74 75 74 75 79 71 79 71 75 74

In the above temperature values, you can see that a the pairs 75 74 as well as 79 71 are repeated in a consecutive fashion. What I want to do is just keep one pair, and get rid of the duplicates. So, the output I want is:

output = [[1, 72], [2, 75], [3, 74], [6, 75], [7, 79], [8, 71], [11, 75], [12, 74]]

Note: elements [11, 75] and [12, 74] are kept because although they also contain this 75 74 pattern, they aren't repeated consecutively, like from earlier in the list.


To try to solve this problem, I searched up and tried a bunch of things. The closest I got was by creating a solution using a for loop, where I'd check an element and the previous element (index-1), and then I'd check index-2 and index-3, and if they were determined to have temperature repeats, I'd delete two of the elements. Then, I'd repeat this looking in the forward direction (index+1). It kind of worked, but things got super messy and was extremely slow, and it turned my computer into a portbable heater. So, I was wondering if anyone knows how to efficiently and quickly way to get rid of these consecutive duplicate pairs.

Upvotes: 3

Views: 310

Answers (3)

Srikant
Srikant

Reputation: 294

Assuming that I have understood the requirement correctly, the code below does the job.

mylist = [[1, 72], [2, 75], [3, 74], [4, 75], [5, 74], [6, 75], [7, 79], [8, 71], [9, 79], [10, 71], [11, 75], [12, 74]]

n = len(mylist)
index = 0
output_list = []

# We need at least four elements to check if there is a duplicate pair.
while index + 4 <= n:
    sub_list = mylist[index: index + 4]

    if sub_list[0][1] == sub_list[2][1] and sub_list[1][1] == sub_list[3][1]:
        print('Duplicate found')
        # Drop the second one.
        output_list.append(sub_list[0])
        output_list.append(sub_list[1])
        index += 4
    else:
        # We add only the first element as the there can be a potential duplicate that can be found later on when we consider the next element.
        output_list.append(sub_list[0])
        index += 1

# Append the remaining elements if any exist.
output_list.extend(mylist[index:])


print(output_list)

Upvotes: 3

Andrej Kesely
Andrej Kesely

Reputation: 195543

Using collections.deque:

from collections import deque

mylist = [[1, 72], [2, 75], [3, 74], [4, 75], [5, 74], [6, 75], [7, 79], [8, 71], [9, 79], [10, 71], [11, 75], [12, 74]]

def generate(lst):
    d = deque(maxlen=4)
    for v in lst:
        d.append(v)
        if len(d)==4:
            if d[0][1] == d[2][1] and d[1][1] == d[3][1]:
                d.pop()
                d.pop()
            else:
                yield d.popleft()

    yield from d # yield the rest of deque


out = [*generate(mylist)]
print(out)

Prints:

[[1, 72], [2, 75], [3, 74], [6, 75], [7, 79], [8, 71], [11, 75], [12, 74]]

Benchmark (using 10_000_000 elements):

import random
from timeit import timeit

mylist = []
for i in range(10_000_000):
    mylist.append([i, random.randint(50, 100)])

def f1():
    return [*generate(mylist)]

t1 = timeit(lambda: f1(), number=1)
print(t1)

Prints on my machine (AMD 2400G, Python 3.8):

3.2782217629719526

Upvotes: 2

Gilseung Ahn
Gilseung Ahn

Reputation: 2624

Use collection.Counters and numpy.

Try this code.

import numpy as np
from collections import Counter

def remove_consecutive_pair_duplicate(L):
    temperature = np.array(L, dtype = str)[:, 1]
    l = 2 # length of pattern       
    pattern_with_length_l = Counter('-'.join(temperature[i:i+l]) for i in range(len(temperature) - l))

    set_of_patterns = []
    for (key, val) in pattern_with_length_l.items():
        left, right = key.split('-')        
        if val >= 2 and right + '-' + left not in set_of_patterns:
            set_of_patterns.append(key)

    removed_index = []
    for pattern in set_of_patterns:
        matched_index = [[i, i+1] for i in range(len(temperature) - l) if '-'.join(temperature[i:i+2]) == pattern]
        for ind in matched_index[1:]:
            removed_index.append(ind[0])
            removed_index.append(ind[1])

    survived_ind = list(set(list(range(len(L)))) - set(removed_index))
    return np.array(L)[survived_ind].tolist()

print(remove_consecutive_pair_duplicate(mylist))

The result is as follows.

[[1, 72], [2, 75], [3, 74], [6, 75], [7, 79], [8, 71], [11, 75], [12, 74]]

Upvotes: 1

Related Questions