Reputation: 1523
I want to make a histogram of all the intervals between repeated values in a list. I wrote some code that works, but it's using a for loop with if statements. I often find that if one can manage to write a version using clever slicing and/or predefined python (numpy) methods, that one can get much faster Python code than using for loops, but in this case I can't think of any way of doing that. Can anyone suggest a faster or more pythonic way of doing this?
# make a 'histogram'/count of all the intervals between repeated values
def hist_intervals(a):
values = sorted(set(a)) # get list of which values are in a
# setup the dict to hold the histogram
hist, last_index = {}, {}
for i in values:
hist[i] = {}
last_index[i] = -1 # some default value
# now go through the array and find intervals
for i in range(len(a)):
val = a[i]
if last_index[val] != -1: # do nothing if it's the first time
interval = i - last_index[val]
if interval in hist[val]:
hist[val][interval] += 1
else:
hist[val][interval] = 1
last_index[val] = i
return hist
# example list/array
a = [1,2,3,1,5,3,2,4,2,1,5,3,3,4]
histdict = hist_intervals(a)
print("histdict = ",histdict)
# correct answer for this example
answer = { 1: {3:1, 6:1},
2: {2:1, 5:1},
3: {1:1, 3:1, 6:1},
4: {6:1},
5: {6:1}
}
print("answer = ",answer)
Sample output:
histdict = {1: {3: 1, 6: 1}, 2: {5: 1, 2: 1}, 3: {3: 1, 6: 1, 1: 1}, 4: {6: 1}, 5: {6: 1}}
answer = {1: {3: 1, 6: 1}, 2: {2: 1, 5: 1}, 3: {1: 1, 3: 1, 6: 1}, 4: {6: 1}, 5: {6: 1}}
^ note: I don't care about the ordering in the dict, so this solution is acceptable, but I want to be able to run on really large arrays/lists and I'm suspecting my current method will be slow.
Upvotes: 2
Views: 232
Reputation: 61014
You can eliminate the setup loop by a carefully constructed defaultdict
. Then you're just left with a single scan over the input list, which is as good as it gets. Here I change the resultant defaultdict
back to a regular Dict[int, Dict[int, int]]
, but that's just so it prints nicely.
from collections import defaultdict
def count_intervals(iterable):
# setup
last_seen = {}
hist = defaultdict(lambda: defaultdict(int))
# The actual work
for i, x in enumerate(iterable):
if x in last_seen:
hist[x][i-last_seen[x]] += 1
last_seen[x] = i
return hist
a = [1,2,3,1,5,3,2,4,2,1,5,3,3,4]
hist = count_intervals(a)
for k, v in hist.items():
print(k, dict(v))
# 1 {3: 1, 6: 1}
# 3 {3: 1, 6: 1, 1: 1}
# 2 {5: 1, 2: 1}
# 5 {6: 1}
# 4 {6: 1}
Upvotes: 1
Reputation: 6388
There is an obvious change to make in terms of data structures. instead of using a dictionary of dictionaries for hist
use a defaultdict
of Counter
this lets the code become
from collections import defaultdict, Counter
# make a 'histogram'/count of all the intervals between repeated values
def hist_intervals(a):
values = sorted(set(a)) # get list of which values are in a
# setup the dict to hold the histogram
hist, last_index = defaultdict(Counter), {}
# now go through the array and find intervals
for i, val in enumerate(a):
if val in last_index
interval = i - last_index[val]
hist[val].update((interval,))
last_index[val] = i
return hist
this will be faster as the if
's are written in C, and will also be cleaner.
Upvotes: 0