Natan
Natan

Reputation: 1045

Adding ranges to count overlap python

I have a list of ranges. I now would like to compute a dictionary key : value, where key is the number and value is in how many of those ranges the number exists.

A bad way to compute this is:

from collections import defaultdict
my_dict = defaultdict(int)
ranges = [range(-4200,4200), range(-420,420), range(-42,42), range(8,9), range(9,9), range(9,10)]

for singleRange in ranges:
    for number in singleRange:
        my_dict[number] += 1
sort_dict = sorted(my_dict.items(), key=lambda x: x[1], reverse=True)
print(sort_dict)

How would you do this more efficiently?

Upvotes: 2

Views: 324

Answers (2)

Roméo Després
Roméo Després

Reputation: 2163

Improving on my previous answer, this algorithm solves the problem in O(n + m) where n is the length of the total range and m is the number of sub ranges.

The basic idea is to iterate through the n numbers just once, keeping a counter of the number of ranges the current number belongs to. At each step, we check if we have passed a range start, in which case the counter gets incremented. Conversely, if we have have passed a range stop, the counter gets decremented.

The actual implementation below uses numpy and pandas for all the heavy lifting, so the iterative nature of the algorithm may seem unclear, but it's basically just a vectorized version of what I've described.

Compared to the 600 ms of my previous answer, we're down to 20 ms for 10k ranges on my laptop. Moreover, the memory usage is also O(n + m) here while it was O(nm) there, so much larger n and m become possible. You should probably use this solution instead of the first version.

from collections import defaultdict

import numpy as np
import pandas as pd


# Generate data
def generate_ranges(n):
    boundaries = np.random.randint(-10_000, 10_000, size=(n, 2))
    boundaries.sort(axis=1)
    return [range(x, y) for x, y in boundaries]


ranges = generate_ranges(10_000)


# Extract boundaries
boundaries = np.array([[range.start, range.stop] for range in ranges])

# Add a +1 offset for range starts and -1 for range stops
offsets = np.array([1, -1])[None, :].repeat(boundaries.shape[0], axis=0)
boundaries = np.stack([boundaries, offsets], axis=-1)
boundaries = boundaries.reshape(-1, 2)

# Compute range counts at each crossing of a range boundary
df = pd.DataFrame(boundaries, columns=["n", "offset"])
df = df.sort_values("n")
df["count"] = df["offset"].cumsum()
df = df.groupby("n")["count"].max()

# Expand to all integers by joining and filling NaN
index = pd.RangeIndex(df.index[0], df.index[-1] + 1)
df = pd.DataFrame(index=index).join(df).fillna(method="ffill")

# Finally wrap the result in a defaultdict
d = defaultdict(int, df["count"].astype(int).to_dict())

Upvotes: 2

Roméo Després
Roméo Després

Reputation: 2163

Probably something more efficient can be done, but this solution has the advantage of heavily relying on the speed of numpy. For 10k ranges this runs in ~600 ms on my laptop.

from collections import defaultdict

import numpy as np


# Generate data
def generate_ranges(n):
    boundaries = np.random.randint(-10_000, 10_000, size=(n, 2))
    boundaries.sort(axis=1)
    return [range(x, y) for x, y in boundaries]


ranges = generate_ranges(10_000)


# Extract boundaries
starts, stops = np.array([[range.start, range.stop] for range in ranges]).T

# Set of all numbers we should test
n = np.arange(starts.min(), stops.max() + 1)[:, None]

# Test those numbers
counts = ((n >= starts[None, :]) & (n < stops[None, :])).sum(axis=1)

# Wrap the result into a dict
d = defaultdict(int, dict(zip(n.flatten(), counts)))

Upvotes: 1

Related Questions