Reputation: 3386

How to quickly find first multiple of 2 of list element in list of large integers?

I have a list of approximately 100 000 sorted even integers in the range 10^12 to 10^14. My goal is to find the first integer x in the list such that x*2 is also a member of the list. As my list is rather long speed is very important.

My first thought was to just iterate over the list and check whether each element when multiplied by 2 was also in the list, but upon implementing this it is clear this is too slow for my purposes.

My next thought was to decompose each element in the list into its prime decomposition with SymPy's factorint, and then search my decomposed list for the same decomposition except with an extra 2. This didn't turn out to be any quicker obviously, but I feel like there must be a way using prime decomposition, if not something else.

Upvotes: 6

Answers (5)

dstromberg

Reputation: 7187

It seems to me the best algorithm would use a presorted list and a set.

#!/usr/local/pypy3-2.4.0/bin/pypy3

import time
import random


def get_random_numbers():
    result = []
    for counter in range(99998):
        random_number = (random.randrange(10**12, 10**14) // 2) * 2
        result.append(random_number)
    # Make sure there's at least one solution to the problem
    moderately_large = 10**14 // 2
    result.append(moderately_large)
    result.append(moderately_large * 2)
    # This sort is of course O(n*logn), but I don't think you're
    # counting that, are you?
    result.sort()
    return result


def get_best(list_):
    # This is O(n)
    set_ = set(list_)
    for value in list_:
        if value * 2 in set_:
            return value
    return None


def main():
    list_ = get_random_numbers()
    time0 = time.time()
    result = get_best(list_)
    time1 = time.time()
    print(result)
    print('{} seconds'.format(time1 - time0))


main()

HTH

Upvotes: 0

MaxU - stand with Ukraine

Reputation: 210912

I will add yet another answer, because i've already overloaded my previous one...

Now I came up with a new idea - intersection of SortedSets, which works extremely fast comparing to modified Colin's and to my previous solution. The idea is to generate two SortedSets:

l2 - is an intersection of two lists/sets: the original one l and the one containing all elements from l multiplied by 2: SortedSet(x*2 for x in l).

l1 - is a SortedSet containing all elements belonging to l2, divided by 2: SortedSet(x//2 for x in l2)

Code:

from datetime import datetime as dt
from sortedcontainers import SortedSet
from timeit import Timer

def load2list():
    with open('data', 'r') as f:
        return [int(line) for line in f]

def test_Colin_Pitrat2():
    j = 0
    s = len(l)
    for i in range(0, s):
        l_i = l[i]
        l_i2 = l_i<<1
        while l[j] < l_i2:
            j += 1
            if j == s:
                return -1
        if l[j] == l_i2:
            return l[i]

def test_SortedSet():
    for i in sorted_set:
        if i<<1 in sorted_set:
            return i
    return -1

def test_SetIntersection():
    for i in l1:
        if i*2 in l2:
            return i
    return -1

if __name__=='__main__':
    start_ts = dt.now()
    timeit_number = 10000
    l =  load2list()
    print('load2list() took:\t%d microseconds' %((dt.now() - start_ts).microseconds))

    start_ts = dt.now()
    sorted_set = SortedSet(l)
    l2 = SortedSet(l).intersection(SortedSet(x*2 for x in l))
    l1 = SortedSet(x//2 for x in l2)
    print('preparing sets took:\t%d microseconds' %((dt.now() - start_ts).microseconds))

    print('len(l):\t\t%d' % (len(l)))
    print('len(l1):\t%d' % (len(l1)))
    print('len(l2):\t%d' % (len(l2)))
    print('len(sorted_set):\t%d' % (len(sorted_set)))

    print('Timeit results for %d executions:' %timeit_number)
    print('Colin_Pitrat2:\t\t', Timer(test_Colin_Pitrat2).timeit(timeit_number))
    print('SortedSet:\t\t', Timer(test_SortedSet).timeit(timeit_number))
    print('SetIntersection:\t', Timer(test_SetIntersection).timeit(timeit_number))

Output:

load2list() took:       230023 microseconds
preparing sets took:    58106 microseconds
len(l):         498786
len(l1):        256
len(l2):        256
len(sorted_set):        498562
Timeit results for 10000 executions:
Colin_Pitrat2:           23.557948959065648
SortedSet:               6.658937808213555
SetIntersection:         0.012540539222982261

PS i really like this question, because it gave me a chance to learn something new. And I also like the Colin's algorithm - it's smart. I've already upvoted it.

Upvotes: 1

MaxU - stand with Ukraine

Reputation: 210912

The Colin Pitrat's solution is very good, but i guess you can top it when using sortedcontainers.SortedSet

I have generated a file with 1.000.000 random numbers (using numpy.random.randint) and checked that there is at least one number (in the middle of the file), which satisfies your condition.

Let's compare both approaches:

import filecmp
from sortedcontainers import SortedSet
from timeit import Timer

def load2list():
    with open('data', 'r') as f:
        return [int(line) for line in f]

def save_result(i, fn):
    with open(fn, 'a') as f:
        print(i, file=f)

def test_Colin_Pitrat():
    j = 0
    for i in range(0, len(l)):
        while l[j] < 2*l[i]:
            j += 1
            if j == len(l):
                return -1
        if l[j] == 2*l[i]:
            save_result(l[i], 'Colin_Pitrat.out')
            return l[i]


def test_SortedSet():
    for i in sorted_set:
        if i<<1 in sorted_set:
            save_result(i, 'SortedSet.out')
            return i
    return -1


if __name__=='__main__':
    timeit_number = 10000
    l = load2list()
    sorted_set = SortedSet(l)
    print('len(l):\t\t%d' % (len(l)))
    print('len(sorted_set):\t%d' % (len(sorted_set)))
    print('Timeit results with %d executions:' %timeit_number)
    print('Colin_Pitrat:\t', Timer(test_Colin_Pitrat).timeit(timeit_number))
    print('SortedSet:\t', Timer(test_SortedSet).timeit(timeit_number))
    print("filecmp.cmp('Colin_Pitrat.out', 'SortedSet.out'):\t%s" % (filecmp.cmp('Colin_Pitrat.out', 'SortedSet.out')))

Output:

len(l):         1000001
len(sorted_set):        999504
Timeit results with 10000 executions:
Colin_Pitrat:    35.94529931032006
SortedSet:       2.548847197918647
filecmp.cmp('Colin_Pitrat.out', 'SortedSet.out'):       True

PS as you can see SortedSet is very fast.

UPDATE: (now i'm testing at home, where my PC is much slower, so i will reduce the number of executions to 1.000)

As Colin Pitrat proposed i'm generating now data (approx. 100.000 numbers) for the worst scenario - when no match can be found. Now i will compare three functions: test_Colin_Pitrat, test_Colin_Pitrat2 (tuned version), test_SortedSet...

Data generator script:

import numpy as np

l = np.random.randint(10**7, 10**9, 200000)
l = l[ l % 2 > 0 ]
np.savetxt('data', np.sort(l), fmt='%d')

Code:

import filecmp
from sortedcontainers import SortedSet
from timeit import Timer

def load2list():
    with open('data', 'r') as f:
        return [int(line) for line in f]

def save_result(i, fn):
    with open(fn, 'a') as f:
        print(i, file=f)

def test_Colin_Pitrat():
    j = 0
    for i in range(0, len(l)):
        while l[j] < 2*l[i]:
            j += 1
            if j == len(l):
                return -1
        if l[j] == 2*l[i]:
            return l[i]

def test_Colin_Pitrat2():
    j = 0
    s = len(l)
    for i in range(0, s):
        l_i = l[i]
        l_i2 = l_i<<1
        while l[j] < l_i2:
            j += 1
            if j == s:
                return -1
        if l[j] == l_i2:
            return l[j]


def test_SortedSet():
    for i in sorted_set:
        if i<<1 in sorted_set:
            return i
    return -1


if __name__=='__main__':
    timeit_number = 1000
    l = load2list()
    sorted_set = SortedSet(l)
    print('len(l):\t\t%d' % (len(l)))
    print('len(sorted_set):\t%d' % (len(sorted_set)))
    print('Timeit results for %d executions:' %timeit_number)
    print('Colin_Pitrat:\t', Timer(test_Colin_Pitrat).timeit(timeit_number))
    print('Colin_Pitrat2:\t', Timer(test_Colin_Pitrat2).timeit(timeit_number))
    print('SortedSet:\t', Timer(test_SortedSet).timeit(timeit_number))

Output:

len(l):         99909
len(sorted_set):        99899
Timeit results for 1000 executions:
Colin_Pitrat:    153.04753258882357
Colin_Pitrat2:   103.68264272815443
SortedSet:       99.59669211136577

Conclusion: Colin_Pitrat2 is 33% faster compared to Colin_Pitrat and almost as fast as SortedSet.

Upvotes: 3

Colin Pitrat

Reputation: 2092

You can iterate on your list with two iterators: one pointing to the current element and one pointing to the first one greater or equal to it's double. This will be O(N).

Here is a draft of the idea:

l = [1, 3, 5, 7, 10, 12, 15]

# ...
j = 0
for i in range(0, len(l)):
    while l[j] < 2*l[i]:
        j += 1
        if j == len(l):
            return -1
    if l[j] == 2*l[i]:
        return i

Edit: Following comments in another answer, a few performances tests show that this version will be much faster (3 times in my tests) by eliminating multiplications, calls to len and reducing number of item retrievals in the list:

j = 0
s = len(l)
for i in range(0, s):
    l_i = l[i]
    l_i2 = l_i<<1
    while l[j] < l_i2:
        j += 1
        if j == s:
            return -1
    if l[j] == l_i2:
        return i

Upvotes: 7

Mario Freitas

Reputation: 161

You can divide your list in two so that the lower list contains all numbers from the first one on the original list to the last number that is equal to or smaller than half the maximum value on your list. That will speed up your search a little bit as you will iterate over only half of the list and search if x*2 is on the other half. See example bellow

l = [1, 3, 5, 7, 10, 12, 15]

max_l = l[-1]

for i in range(len(l)):
  if l[i] > max_l/2:
    break
  pos_mid_l = i

print pos_mid_l # returns position 3

lower_l = l[:pos_mid_l+1]
upper_l = l[pos_mid_l+1:]

print lower_l   # returns lower half of your list
print upper_l   # returns upper half of your list

for i in  lower_l:
  if i < upper_l[0]/2:
    if i*2 in lower_l:
      ans = i
      break
  else:
    if i*2 in upper_l:
      ans = i
      break

print ans # Your answer :)

Upvotes: 0

How to quickly find first multiple of 2 of list element in list of large integers?

Answers (5)

Related Questions