Reputation: 303
I have a bunch of long lists (millions of elements long) that contain a time value and a temperature value ([time, temperature]
). The lists look like this:
mylist = [[1, 72], [2, 75], [3, 74], [4, 75], [5, 74], [6, 75], [7, 79], [8, 71], [9, 79], [10, 71], [11, 75], [12, 74]]
What I want to do is get rid of consecutive pair duplicates. If a pair of temperatures are repeated consecutively, get rid of these duplicate elements (keeping only one).
That wording may be kind of confusing, so I'll provide an example using mylist
:
mylist[0]
and mylist[1]
are consecutive pairs. Same with mylist[1]
and mylist[2]
, and so on.
Moving on. Now, look at the temperature value from mylist
. From mylist[0]
all the way to mylist[11]
, the temperature values are:
72 75 74 75 74 75 79 71 79 71 75 74
In the above temperature values, you can see that a the pairs 75 74
as well as 79 71
are repeated in a consecutive fashion. What I want to do is just keep one pair, and get rid of the duplicates. So, the output I want is:
output = [[1, 72], [2, 75], [3, 74], [6, 75], [7, 79], [8, 71], [11, 75], [12, 74]]
Note: elements [11, 75]
and [12, 74]
are kept because although they also contain this 75 74
pattern, they aren't repeated consecutively, like from earlier in the list.
To try to solve this problem, I searched up and tried a bunch of things. The closest I got was by creating a solution using a for
loop, where I'd check an element and the previous element (index-1), and then I'd check index-2 and index-3, and if they were determined to have temperature repeats, I'd delete two of the elements. Then, I'd repeat this looking in the forward direction (index+1). It kind of worked, but things got super messy and was extremely slow, and it turned my computer into a portbable heater. So, I was wondering if anyone knows how to efficiently and quickly way to get rid of these consecutive duplicate pairs.
Upvotes: 3
Views: 310
Reputation: 294
Assuming that I have understood the requirement correctly, the code below does the job.
mylist = [[1, 72], [2, 75], [3, 74], [4, 75], [5, 74], [6, 75], [7, 79], [8, 71], [9, 79], [10, 71], [11, 75], [12, 74]]
n = len(mylist)
index = 0
output_list = []
# We need at least four elements to check if there is a duplicate pair.
while index + 4 <= n:
sub_list = mylist[index: index + 4]
if sub_list[0][1] == sub_list[2][1] and sub_list[1][1] == sub_list[3][1]:
print('Duplicate found')
# Drop the second one.
output_list.append(sub_list[0])
output_list.append(sub_list[1])
index += 4
else:
# We add only the first element as the there can be a potential duplicate that can be found later on when we consider the next element.
output_list.append(sub_list[0])
index += 1
# Append the remaining elements if any exist.
output_list.extend(mylist[index:])
print(output_list)
Upvotes: 3
Reputation: 195543
Using collections.deque
:
from collections import deque
mylist = [[1, 72], [2, 75], [3, 74], [4, 75], [5, 74], [6, 75], [7, 79], [8, 71], [9, 79], [10, 71], [11, 75], [12, 74]]
def generate(lst):
d = deque(maxlen=4)
for v in lst:
d.append(v)
if len(d)==4:
if d[0][1] == d[2][1] and d[1][1] == d[3][1]:
d.pop()
d.pop()
else:
yield d.popleft()
yield from d # yield the rest of deque
out = [*generate(mylist)]
print(out)
Prints:
[[1, 72], [2, 75], [3, 74], [6, 75], [7, 79], [8, 71], [11, 75], [12, 74]]
Benchmark (using 10_000_000 elements):
import random
from timeit import timeit
mylist = []
for i in range(10_000_000):
mylist.append([i, random.randint(50, 100)])
def f1():
return [*generate(mylist)]
t1 = timeit(lambda: f1(), number=1)
print(t1)
Prints on my machine (AMD 2400G, Python 3.8):
3.2782217629719526
Upvotes: 2
Reputation: 2624
Use collection.Counters
and numpy
.
Try this code.
import numpy as np
from collections import Counter
def remove_consecutive_pair_duplicate(L):
temperature = np.array(L, dtype = str)[:, 1]
l = 2 # length of pattern
pattern_with_length_l = Counter('-'.join(temperature[i:i+l]) for i in range(len(temperature) - l))
set_of_patterns = []
for (key, val) in pattern_with_length_l.items():
left, right = key.split('-')
if val >= 2 and right + '-' + left not in set_of_patterns:
set_of_patterns.append(key)
removed_index = []
for pattern in set_of_patterns:
matched_index = [[i, i+1] for i in range(len(temperature) - l) if '-'.join(temperature[i:i+2]) == pattern]
for ind in matched_index[1:]:
removed_index.append(ind[0])
removed_index.append(ind[1])
survived_ind = list(set(list(range(len(L)))) - set(removed_index))
return np.array(L)[survived_ind].tolist()
print(remove_consecutive_pair_duplicate(mylist))
The result is as follows.
[[1, 72], [2, 75], [3, 74], [6, 75], [7, 79], [8, 71], [11, 75], [12, 74]]
Upvotes: 1