Adding ranges to count overlap python

Question

I have a list of ranges. I now would like to compute a dictionary key : value, where key is the number and value is in how many of those ranges the number exists.

A bad way to compute this is:

from collections import defaultdict
my_dict = defaultdict(int)
ranges = [range(-4200,4200), range(-420,420), range(-42,42), range(8,9), range(9,9), range(9,10)]

for singleRange in ranges:
    for number in singleRange:
        my_dict[number] += 1
sort_dict = sorted(my_dict.items(), key=lambda x: x[1], reverse=True)
print(sort_dict)

How would you do this more efficiently?

Rom&#233;o Despr&#233;s · Accepted Answer

Improving on my previous answer, this algorithm solves the problem in O(n + m) where n is the length of the total range and m is the number of sub ranges.

The basic idea is to iterate through the n numbers just once, keeping a counter of the number of ranges the current number belongs to. At each step, we check if we have passed a range start, in which case the counter gets incremented. Conversely, if we have have passed a range stop, the counter gets decremented.

The actual implementation below uses numpy and pandas for all the heavy lifting, so the iterative nature of the algorithm may seem unclear, but it's basically just a vectorized version of what I've described.

Compared to the 600 ms of my previous answer, we're down to 20 ms for 10k ranges on my laptop. Moreover, the memory usage is also O(n + m) here while it was O(nm) there, so much larger n and m become possible. You should probably use this solution instead of the first version.

from collections import defaultdict

import numpy as np
import pandas as pd


# Generate data
def generate_ranges(n):
    boundaries = np.random.randint(-10_000, 10_000, size=(n, 2))
    boundaries.sort(axis=1)
    return [range(x, y) for x, y in boundaries]


ranges = generate_ranges(10_000)


# Extract boundaries
boundaries = np.array([[range.start, range.stop] for range in ranges])

# Add a +1 offset for range starts and -1 for range stops
offsets = np.array([1, -1])[None, :].repeat(boundaries.shape[0], axis=0)
boundaries = np.stack([boundaries, offsets], axis=-1)
boundaries = boundaries.reshape(-1, 2)

# Compute range counts at each crossing of a range boundary
df = pd.DataFrame(boundaries, columns=["n", "offset"])
df = df.sort_values("n")
df["count"] = df["offset"].cumsum()
df = df.groupby("n")["count"].max()

# Expand to all integers by joining and filling NaN
index = pd.RangeIndex(df.index[0], df.index[-1] + 1)
df = pd.DataFrame(index=index).join(df).fillna(method="ffill")

# Finally wrap the result in a defaultdict
d = defaultdict(int, df["count"].astype(int).to_dict())

Adding ranges to count overlap python

Answers (2)

Related Questions