Reputation: 49
I have a file that contains the following data:
0.00006598 0.00006591 0.00006617 0.00006555 0.00006550 0.00006557 0.00006555 0.00006564 0.00006586 0.00006591 0.00006621 0.00006623 0.00006597 0.00006606 0.00006624 0.00006553 0.00006589 0.00006586 0.00006610 0.00006610 0.00006611 0.00006598 0.00006598 0.00006591 0.00006608 0.00006600 0.00006600 0.00006600
The full list contains hundreds of rows.
I want to find the index of the next instance of the same value.
So if i take the first entry within this list, 0.00006598, i'd like to iterate through the list, and return the index value of the next instance of 0.00006598.
Once it reaches the next instance, use the second instance to look for the third and so on. I'd like to do this for each of the unique values in the list.
I've been able to identify how many instances of each value are in the list using the following:
with open("testdata.txt", "r+") as f:
lines = f.read().splitlines()
for num, line in enumerate(lines):
occurrences = lines.count(line)
print(str(line) + " " + str(occurrences) + " " + str(num))
My intention is to find the largest difference between the indexes for all the values.
What would be the best approach to do this?
Upvotes: 0
Views: 126
Reputation: 168863
collections.defaultdict
to the rescue - gather up each line number per value, then process them. This will work up to Very Large files (or Very Large numbers of distinct values):
from collections import defaultdict
import io
# test data, simulating a file (this could just as well be the open file)
test_data = io.StringIO(
"""
0.00006598
0.00006591
0.00006617
0.00006555
0.00006550
0.00006557
0.00006555
0.00006564
0.00006586
0.00006591
0.00006621
0.00006623
0.00006597
0.00006606
0.00006624
0.00006553
0.00006589
0.00006586
0.00006610
0.00006610
0.00006611
0.00006598
0.00006598
0.00006591
0.00006608
0.00006600
0.00006600
0.00006600
""".strip()
)
occurrences = defaultdict(list)
for lineno, value in enumerate(test_data):
occurrences[value.strip()].append(lineno)
for value, linenos in occurrences.items():
largest_diff = max(linenos) - min(linenos)
if largest_diff:
print(value, linenos, largest_diff)
prints out e.g.
> python so62755020.py
0.00006598 [0, 21, 22] 22
0.00006591 [1, 9, 23] 22
0.00006555 [3, 6] 3
0.00006586 [8, 17] 9
0.00006610 [18, 19] 1
0.00006600 [25, 26, 27] 2
EDIT: To respond to the comment, to get a list sorted by the largest diff,
sorted_occ = sorted(
(
(value, max(linenos) - min(linenos))
for value, linenos in occurrences.items()
),
key=lambda pair: pair[1],
reverse=True,
)
for value, largest_diff in sorted_occ:
print(value, largest_diff)
outputs
0.00006598 22
0.00006591 22
0.00006586 9
0.00006555 3
0.00006600 2
0.00006610 1
...
Upvotes: 1