Reputation: 5421
I am working on a postage application which is required to check an integer postcode against a number of postcode ranges, and return a different code based on which range the postcode matches against.
Each code has more than one postcode range. For example, the M code should be returned if the postcode is within the ranges 1000-2429, 2545-2575, 2640-2686 or is equal to 2890.
I could write this as:
if 1000 <= postcode <= 2429 or 2545 <= postcode <= 2575 or 2640 <= postcode <= 2686 or postcode == 2890:
return 'M'
but this seems like a lot of lines of code, given that there are 27 returnable codes and 77 total ranges to check against. Is there a more efficient (and preferably more concise) method of matching an integer to all these ranges using Python?
Edit: There's a lot of excellent solutions flying around, so I have implemented all the ones that I could, and benchmarked their performances.
The environment for this program is a web service (Django-powered actually) which needs to check postcode region codes one-by-one, on the fly. My preferred implementation, then, would be one that can be quickly used for each request, and does not need any process to be kept in memory, or needs to process many postcodes in bulk.
I tested the following solutions using timeit.Timer
with default 1000000 repetitions using randomly generated postcodes each time.
if 1000 <= postcode <= 2249 or 2555 <= postcode <= 2574 or ...:
return 'M'
if 2250 <= postcode <= 2265 or ...:
return 'N'
...
Time for 1m reps: 5.11 seconds.
Somewhat more elegant to my mind and certainly easier to enter and read the ranges. Particularly good if they change over time, which is possible. But it did end up four times slower in my implementation.
if any(lower <= postcode <= upper for (lower, upper) in [(1000, 2249), (2555, 2574), ...]):
return 'M'
if any(lower <= postcode <= upper for (lower, upper) in [(2250, 2265), ...]):
return 'N'
...
Time for 1m reps: 19.61 seconds.
As stated by the author, "it's only better if you are building the set once to check against many postcodes in a loop". But I thought I would test it anyway to see.
if postcode in set(chain(*(xrange(start, end+1) for start, end in ((1000, 2249), (2555, 2574), ...)))):
return 'M'
if postcode in set(chain(*(xrange(start, end+1) for start, end in ((2250, 2265), ...)))):
return 'N'
...
Time for 1m reps: 339.35 seconds.
This one may have been a bit above my intellect level. I learnt a lot reading about the bisect
module but just couldn't quite work out which parameters to give find_ge()
to make a runnable test. I expect that it would be extremely fast with a loop of many postcodes, but not if it had to do the setup each time. So, with 1m repetitions of filling numbers
, edgepairs
, edgeanswers
etc for just one postal region code (the M code with four ranges), but not actually running the fast_solver
:
Time for 1m reps: 105.61 seconds.
Using one dict per postal region code pre-generated, cPickled in a source file (106 KB), and loaded for each run. I was expecting much better performance from this method, but on my system at least, the IO really destroyed it. The server is a not-quite-blindingly-fast-top-of-the-line Mac Mini.
Time for 1m reps: 5895.18 seconds (extrapolated from a 10,000 run).
Well, I was expecting someone to just give a simple 'duh' answer that I hadn't considered, but it turns out this is much more complicated (and even a little controversial).
If every nanosecond of efficiency counted in this case, I would probably keep a separate process running which implemented one of the binary search or dict solutions and kept the result in memory for an extremely fast lookup. However, since the IF tree takes only five seconds to run a million times, which is plenty fast enough for my small business, that's what I'll end up using in my application.
Thank you to everyone for contributing!
Upvotes: 46
Views: 36832
Reputation: 726
Another solution is based on looking at the data set limits.
When possible, you make a table(s) that encompasses the range of values reduced to a common denominator for quick lookup.
Sort of like a hash table.
Borrowing the idea from the tricks OS kernel developers do for quick address range checks on memory ranges that are usually in 4096 (page size) multiples.
Over the OPs original set of values: (1000,2429), (2545,2575), (2640,2686), (2890, 2890)
If you divide these by 100 you will get a set of buckets: (10 to 24), 25, 26, and 28
We generate two tables then:
ranges = ((1000,2429), (2545,2575), (2640,2686), (2890, 2890))
buckets = {10:0,11:0,12:0,13:0,14:0,15:0,16:0,17:0,18:0,19:0,20:0,21:0,22:0,23:0,24:0, 25:1, 26:2, 28:3}
Our test function:
def find_m(number: int):
# In a bucket?
index = buckets.get(number // 100)
if index != None:
# Yes, is it in range?
if ranges[index][0] <= number <= ranges[index][1]:
# Yes
return 'M'
return None
Lets test it:
print(find_m(1500))
print(find_m(2600))
print(find_m(2890))
print(find_m(1))
The (correct) output:
M
None
M
None
This might be an optimal solution speed-wise since we reduce the range check to just a divide, a dictionary lookup, and a verifying range check. About as fast as Python does it's internal native code dictionary hash table lookup.
Extra points:
Proverb: "know thy data"
Upvotes: 0
Reputation: 80
In Python 3.2 functools.lru_cache was introduced. Your solution along with the beforementioned decorator should be pretty fast. Or, Python 3.9's functools.cache could be used as well (which should be even faster).
Upvotes: 0
Reputation: 9740
A bit of a silly approach to an old question, but I was curious how well regex character classes would handle the problem, since this exact problem occurs frequently in questions about character validity.
To make a regex for the "M" postal codes you showed, we can turn the numbers into unicode using chr()
:
m_ranges = [(1000, 2249), (2545, 2575), (2640, 2686)]
m_singletons = [2890]
m_range_char_class_members = [fr"{chr(low)}-{chr(high)}" for (low, high) in m_ranges]
m_singleton_char_class_members = [fr"{chr(x)}" for x in m_singletons]
m_char_class = f"[{''.join(m_range_char_class_members + m_singleton_char_class_members)}]"
m_regex = re.compile(m_char_class)
Then a very rough benchmark on 1 million random postal codes for this method vs your original if-statement:
test_values = [random.randint(1000, 9999) for _ in range(1000000)]
def is_m_regex(num):
return m_regex.match(chr(num))
def is_m_if(num):
return 1000 <= num <= 2249 or 2545 <= num <= 2575 or 2640 <= num <= 2686 or num == 2890
def run_regex_test():
start_time = time.time()
for i in test_values:
is_m_regex(i)
print("--- REGEX: %s seconds ---" % (time.time() - start_time))
def run_if_test():
start_time = time.time()
for i in test_values:
is_m_if(i)
print("--- IF: %s seconds ---" % (time.time() - start_time))
...
running regex test
--- REGEX: 0.3418138027191162 seconds ---
--- IF: 0.19183707237243652 seconds ---
So this would suggest that for comparing one character at a time, using raw if statements is faster than character classes in regexes. No surprise here, since using regex is a bit silly for this problem.
BUT. When doing a an operation like sub
to eliminate all matches from a string composed of all the original test values, it ran much quicker:
blob_value = ''.join([chr(x) for x in test_values])
def run_regex_test_char_blob():
start_time = time.time()
subbed = m_regex.sub('', blob_value)
print("--- REGEX BLOB: %s seconds ---" % (time.time() - start_time))
print(f"original blob length : {len(blob_value)}")
print(f"sub length : {len(subbed)}")
...
--- REGEX BLOB: 0.03655815124511719 seconds ---
original blob length : 1000000
sub length : 851928
The sub
method here replaces all occurrences of M-postal-characters (~15% of this sample), which means it operated on all 1 million characters of the string. That would suggest to me that mass operations by the re
package are MUCH more efficient than individual operations suggested in these answers. So if you've really got a lot of comparisons to do at once in a data pipeline, you may find the best performance by doing some string composition and using regex.
Upvotes: 0
Reputation: 3328
The full data isn't there, but I'm assuming the ranges are non-overlapping, so you can express your ranges as a single sorted tuple of ranges, along with their codes:
ranges = (
(1000, 2249, 'M'),
(2250, 2265, 'N'),
(2555, 2574, 'M'),
# ...
)
This means we can binary search over them in one go. This should be O(log(N)) time, which should result in pretty decent performance with very large sets.
def code_lookup(value, ranges):
left, right = 0, len(ranges)
while left != right - 1:
mid = left + (right - left) // 2
if value <= ranges[mid - 1][1]: # Check left split max
right = mid
elif value >= ranges[mid][0]: # Check right split min
left = mid
else: # We are in a gap
return None
if ranges[left][0] <= value <= ranges[left][1]:
# Return the code
return ranges[left][2]
I don't have your exact values, but for comparison I ran it against some generated ranges (77 ranges with various codes) and compared it to a naive approach:
def get_code_naive(value):
if 1000 < value < 2249:
return 'M'
if 2250 < value < 2265:
return 'N'
# ...
The result for 1,000,000 was that the naive version ran in about 5 sec and the binary search version in 4 sec. So it's a bit faster (20%), the codes are a lot nicer to maintain and the longer the list gets, the more it will out-perform the naive method over time.
Upvotes: 5
Reputation: 649
Recently I had a similar requirement and I used bit manipulation to test if an integer belongs to said range. It is definitely faster, but I guess not suitable if your ranges involve huge numbers. I liberally copied example methods from here
First we create a binary number which will have all bits in the range set to 1.
#Sets the bits to one between lower and upper range
def setRange(permitRange, lower, upper):
# the range is inclusive of left & right edge. So add 1 upper limit
bUpper = 1 << (upper + 1)
bLower = 1 << lower
mask = bUpper - bLower
return (permitRange | mask)
#For my case the ranges also include single integers. So added method to set single bits
#Set individual bits to 1
def setBit(permitRange, number):
mask = 1 << vlan
return (permitRange| mask)
Now time to parse the range and populate our binary mask. If the highest number in the range is n, we will be creating integer greater than 2^n in binary
#Example range (10-20, 25, 30-50)
rangeList = "10-20, 25, 30-50"
maxRange = 100
permitRange = 1 << maxRange
for range in rangeList.split(","):
if range.isdigit():
permitRange = setBit(permitRange, int(range))
else:
lower, upper = range.split("-",1)
permitRange = setRange(permitRange, int(lower), int(upper))
return permitRange
To check if a number 'n' belongs to the range, simply test the bit at n'th position
#return a non-zero result, 2**offset, if the bit at 'offset' is one.
def testBit(permitRange, number):
mask = 1 << number
return (permitRange & mask)
if testBit(permitRange,10):
do_something()
Upvotes: 3
Reputation: 49126
Here is a fast and short solution, using numpy:
import numpy as np
lows = np.array([1, 10, 100]) # the lower bounds
ups = np.array([3, 15, 130]) # the upper bounds
def in_range(x):
return np.any((lows <= x) & (x <= ups))
Now for instance
in_range(2) # True
in_range(23) # False
Upvotes: 9
Reputation: 17173
you only have to solve for edge cases and for one number between edge cases when doing inequalities.
e.g. if you do the following tests on TEN:
10 < 20, 10 < 15, 10 > 8, 10 >12
It will give True True True False
but note that the closest numbers to 10 are 8 and 12
this means that 9,10,11 will give the answers that ten did.. if you don't have too many initial range numbers and they are sparse then this well help. Otherwise you will need to see if your inequalities are transitive and use a range tree or something.
So what you can do is sort all of your boundaries into intervals. e.g. if your inequalities had the numbers 12, 50, 192,999
you would get the following intervals that ALL have the same answer: less than 12, 12, 13-49, 50, 51-191, 192, 193-998, 999, 999+
as you can see from these intervals we only need to solve for 9 cases and we can then quickly solve for anything.
Here is an example of how I might carry it out for solving for a new number x using these pre-calculated results:
a) is x a boundary? (is it in the set) if yes, then return the answer you found for that boundary previously. otherwise use case b)
b) find the maximum boundary number that is smaller than x, call it maxS find the minimum boundary number that is larger than x call it minL. Now just return any previously found solution that was between maxS and minL.
see Python binary search-like function to find first number in sorted list greater than a specific value for finding closest numbers. bisect module will help (import it in your code) This will help finding maxS and minL
You can use bisect and the function i have included in my sample code:
def find_ge(a, key):
'''Find smallest item greater-than or equal to key.
Raise ValueError if no such item exists.
If multiple keys are equal, return the leftmost.
'''
i = bisect_left(a, key)
if i == len(a):
raise ValueError('No item found with key at or above: %r' % (key,))
return a[i]
ranges=[(1000,2429), (2545,2575), (2640,2686), (2890, 2890)]
numbers=[]
for pair in ranges:
numbers+=list(pair)
numbers+=[-999999,999999] #ensure nothing goes outside the range
numbers.sort()
edges=set(numbers)
edgepairs={}
for i in range(len(numbers)-1):
edgepairs[(numbers[i],numbers[i+1])]=(numbers[i+1]-numbers[i])//2
def slow_solver(x):
return #your answer for postcode x
listedges=list(edges)
edgeanswers=dict(zip(listedges,map(solver,listedges)))
edgepairsanswers=dict(zip(edgepairs.keys(),map(solver,edgepairs.values())))
#now we are ready for fast solving:
def fast_solver(x):
if x in edges:
return edgeanswers[x]
else:
#find minL and maxS using find_ge and your own similar find_le
return edgepairsanswers[(minL,maxS)]
Upvotes: 5
Reputation:
Warning - This is probably premature optimisation. For a large list of ranges it might be worthwhile, but probably not in your case. Also, although dictionary/set solutions will use more memory, they are still probably a better choice.
You could do a binary-search into your range end-points. This would be easy if all ranges are non-overlapping, but could still be done (with some tweaks) for overlapping ranges.
Do a find-highest-match-less-than binary search. This is the same as a find-lowest-match-greater-than-or-equal (lower bound) binary search, except that you subtract one from the result.
Use half-open items in your list of end points - that is if your range is 1000..2429 inclusive, use the values 1000 and 2430. If you get an end-point and a start-point with the same value (two ranges touching, so there is no gap between) exclude the end-point for the lower range from your list.
If you find a start-of-range end-point, your goal value is within that range. If you find an end-of-range end-point, your goal value isn't in any range.
The binary search algorithm is roughly (don't expect this to run without editing)...
while upperbound > lowerbound :
testpos = lowerbound + ((upperbound-lowerbound) // 2)
if item [testpos] > goal :
# new best-so-far
upperbound = testpos
else :
lowerbound = testpos + 1
Note - the "//" division operator is necessary for integer division in Python 3. In Python 2, the normal "/" will work, but it's best to be ready for Python 3.
At the end, both upperbound and lowerbound point to the found item - but for the "upper bound" search. Subtract one to get the required search result. If that gives -1, there is no matching range.
There's probably a binary search routine in the library that does the upper-bound search, so prefer that to this if so. To get a better understanding of how the binary search works, see How can I better understand the one-comparison-per-iteration binary search? - no, I'm not above begging for upvotes ;-)
Upvotes: 1
Reputation: 304137
Probably the fastest will be to check the membership of a set
>>> from itertools import chain
>>> ranges = ((1000, 2429), (2545, 2575), (2640, 2686), (2890, 2890))
>>> postcodes = set(chain(*(xrange(start, end+1) for start, end in ranges)))
>>> 1000 in postcodes
True
>>> 2500 in postcodes
False
But it does use more memory this way, and building the set takes time, so it's only better if you are building the set once to check against many postcodes in a loop
EDIT: seems that different ranges need to map to different letters
>>> from itertools import chain
>>> ranges = {'M':((1000,2429), (2545,2575), (2640,2686), (2890, 2890)),
# more ranges
}
>>> postcodemap = dict((k,v) for v in ranges for k in chain(*imap(xrange, *zip(*ranges[v]))))
>>> print postcodemap.get(1000)
M
>>> print postcodemap.get(2500)
None
Upvotes: 8
Reputation: 36134
Python has a range(a, b) function which means the range from (and including) a, to (but excluding) b. You can make a list of these ranges and check to see if a number is in any of them. It may be more efficient to use xrange(a, b) which has the same meaning but doesn't actually make a list in memory.
list_of_ranges = []
list_of_ranges.append(xrange(1000, 2430))
list_of_ranges.append(xrange(2545, 2576))
for x in [999, 1000, 2429, 2430, 2544, 2545]:
result = False
for r in list_of_ranges:
if x in r:
result = True
break
print x, result
Upvotes: 1
Reputation: 134811
You can throw your ranges into tuples and put the tuples in a list. Then use any()
to help you find if your value is within these ranges.
ranges = [(1000,2429), (2545,2575), (2640,2686), (2890, 2890)]
if any(lower <= postcode <= upper for (lower, upper) in ranges):
print('M')
Upvotes: 19