kruttikaSW29
kruttikaSW29

Reputation: 1

find frequency of numbers in intervals

What would be the most efficient way to find the frequency/count of elements in non-overlapping intervals? For example:

limits = [0, 25, 40, 60]
data = [15, 5, 2, 56, 45, 23, 6, 59, 33, 18]

For the above lists, I want to find the number of elements in data that are within two adjacent limits. So for the above, the count would be something like:

0-25: 6; 25-40: 1; 40-60: 3;

All I can think of is O(n^2) in time. Is there a better way to do it?

Upvotes: 0

Views: 831

Answers (5)

Paddy3118
Paddy3118

Reputation: 4772

Doesn't need Counter to count as number of bins is known, swaps dict to array accesses for binning..

from bisect import bisect_right

def bin_it(limits, data):
    "Bin data according to (ascending) limits."
    bins = [0] * (len(limits) + 1)      # adds under/over range bins too

    for d in data:
        bins[bisect_right(limits, d)] += 1

    return bins

if __name__ == "__main__":
    limits = [0, 25, 40, 60]
    data = [15, 5, 2, 56, 45, 23, 6, 59, 33, 18]

    bins = bin_it(limits, data)

    print(f"         < {limits[0]:2} :", bins[0])
    for lo, hi, count in zip(limits, limits[1:], bins[1:]):
        print(f">= {lo:2} .. < {hi:2} :", count)
    print(f">= {limits[-1]:2} ...     :", bins[-1])

"""
SAMPLE OUTPUT:

         <  0 : 0
>=  0 .. < 25 : 6
>= 25 .. < 40 : 1
>= 40 .. < 60 : 3
>= 60 ...     : 0

"""

Upvotes: 1

Alain T.
Alain T.

Reputation: 42133

You could use Counter (from collections) to manage the tallying and bisect to categorize:

from collections import Counter
from bisect import bisect_left

limits = [0, 25, 40, 60, 80]
data   = [15, 5, 2, 56, 45, 23, 6, 59, 33, 18]

r = Counter(limits[bisect_left(limits,d)-1] for d in data)

print(r)
Counter({0: 6, 40: 3, 25: 1})

This has a time complexity of O(NLogM) where M is the number of limit breaks and N is the number of data items

Upvotes: 0

Manish D
Manish D

Reputation: 1

limits = [0, 25, 40, 60, 80]
data = [15, 5, 2, 56, 45, 23, 6, 59, 33, 18, 25, 45, 85]
dict_data = {}
i = 0
count = 1
while i < len(limits)-1:
    for j in data:
        if j in range(limits[i], limits[i+1]):
            if '{}-{}'.format(limits[i],limits[i+1]) in dict_data:
                dict_data['{}-{}'.format(limits[i],limits[i+1])] +=count
            else:
                dict_data['{}-{}'.format(limits[i],limits[i+1])] = count
        
    i+=1
print(dict_data)

Upvotes: 0

Abhinav Mathur
Abhinav Mathur

Reputation: 8101

Here is a simple O(NlogN) approach. Sort your data, then use a two pointer approach to place each element in the correct interval.

limits = [0, 25, 40, 60]
data = [15, 5, 2, 56, 45, 23, 6, 59, 33, 18]
data.sort()

n,m = len(data), len(limits)
count = [0]*(m-1)
# count[i] represents count between limits[i] and limits[i+1]

low = 0  # lower index of interval we are currently checking
ptr = 0

while ptr < n:
    i = data[ptr]
    if i >= limits[low] and i <= limits[low+1]:
        count[low] += 1
        ptr += 1
    elif i>=limits[low]:
        if low == len(limits)-1:
            break
        low += 1

print(count)

Upvotes: 0

Mahrad Hanaforoosh
Mahrad Hanaforoosh

Reputation: 546

I recommend you this approach which implements what you want in order of O(nlogn)

limits = [0, 25, 40, 60] # m
data = [15, 5, 2, 56, 45, 23, 6, 59, 33, 18] # n
data += limits # O(n+m)
data.sort() # O((n+m)log(n+m)) = O(nlogn)

result=dict() # O(1)
cnt = 0 # O(1)
aux ='' # O(1)
i = 0 # O(1)
for el in data: # O(n+m)
  if el == limits[i]:
    i+=1
    if cnt > 0:
      aux+='-'+str(el)
      result[aux] = cnt # average = O(1)
      cnt = 0
      aux = str(el)
    else:
      aux = str(el)
  else:
    cnt+=1

print(result)
# {'0-25': 6, '25-40': 1, '40-60': 3}

I showed the time complexity of each important line to calculate the total time complexity of the code. the total time complexity of the code is equal to O((n+m)log(n+m)) which can be shown as O(nlogn).

Improvement

you can improve it if you have some assumptions about the inputs. if you have info about the range of limits and data, then you can change the sorting algorithm to counting sort. the time complexity of counting sort is considered as O(n) and the total time complexity of code would be O(n)

Upvotes: 0

Related Questions