Reputation: 2366
I have a sorted list of numbers (10 million) that are in character format, and each entry is constant length of 15 characters. Think like:
100000000000000
100000000000001
...
100000010000000
Now I want to create a regular breakdown in this list, to see how entries are being accumulated in different ranges. The output could be like this:
100000000xxxxxx, 523121 entries
100000001xxxxxx, 32231 entries
Currently I have tried reading the whole list to a set, and search through it. I have tried both, string
and int
format. The integer version is 3 times faster than the string version currently. The codes look like:
collection_str = set(line.strip() for line in open(inputfile)
collection_int = set(int(line.strip()) for line in open(sys.argv[1]))
def find_str(look_for, ourset):
count = 0
for entry in ourset:
if entry.startswith(look_for):
count += 1
return count
def find_int(look_for, ourset):
search_min = int(str(look_for) + "000000")
search_max = int(str(look_for+1) + "000000")
count = 0
for entry in ourset:
if entry >= search_min and entry < search_max:
count += 1
return count
The results look like this:
"int version"
100000100 27401 (0.515992sec)
100000101 0 (0.511334sec)
100000102 0 (0.510956sec)
100000103 0 (0.510467sec)
100000104 0 (0.512834sec)
100000105 0 (0.511501sec)
"string version"
100000100 27401 (1.794804sec)
100000101 0 (1.794449sec)
100000102 0 (1.802035sec)
100000103 0 (1.797590sec)
100000104 0 (1.793691sec)
100000105 0 (1.796785sec)
I wonder if I could somehow make it even faster ? Even with 0,5s / range, this still takes time if I want run this often to create some periodic statistics...
From searches around I see that some people use bisect
for something similar, but I can't seem to get my head around how it should work.
Upvotes: 1
Views: 198
Reputation: 23753
If the list is sorted, bisect will find the index that meets your criteria using a bisection search. It looks like bisect is much faster than using a numpy array.
import numpy as np
import bisect
from random import randint
from timeit import Timer
ip = ['1{0:014d}'.format(randint(0, 10000000)) for x in xrange(10000000)]
ip = sorted(ip)
print bisect.bisect(ip, '100000000010000')
# 9869
t = Timer("bisect.bisect(ip, '100000000010000')", 'from __main__ import bisect, ip')
print t.timeit(100)
# 0.000268309933485 seconds
ip_int = map(int, ip)
print bisect.bisect(ip_int, 100000000010000)
# 9869
t = Timer("bisect.bisect(ip_int, 100000000010000)", 'from __main__ import bisect, ip_int')
print t.timeit(100)
# 0.000137443078672 seconds
ip_numpy = np.array(ip_int)
print np.sum(ip_numpy <= 100000000010000)
# 9869
t = Timer("np.sum(ip_numpy <= 100000000010000)", 'from __main__ import np, ip_numpy')
print t.timeit(100)
# 8.23690123071 seconds
Upvotes: 1
Reputation: 1048
Put it into a numpy array. Then you can use vectorisation which is nice and fast :)
from random import randint
import numpy
ip = numpy.array(['1{0:014d}'.format(randint(0, 10000000)) for x in xrange(10000000)], dtype=numpy.int64)
numpy.sum(ip <= 100000000010000)
# 9960
%timeit numpy.sum(ip <= 100000000010000)
# 10 loops, best of 3: 35 ms per loop
Putting this in terms of your search functions:
import numpy
def find_numpy(look_for, ourset):
search_min = int('{0:0<15s}'.format(str(look_for)))
search_max = int('{0:0<15s}'.format(str(look_for+1)))
return numpy.sum((ourset >= search_min) & (ourset < search_max))
with open('path/to/your/file.txt', 'r') as f:
ip = numpy.array([line.strip() for line in f], dtype=numpy.int64)
find_numpy(1000000001, ip)
# 99686
%timeit find_numpy(1000000001, ip)
# 10 loops, best of 3: 86.6 ms per loop
Upvotes: 2