Reputation: 93

Fast method to find indexes of duplicates in a lists >2000000 items

I have a list where each item is a combination of two event ids: (This is just a snippet of the much larger list of pairs)

['10000381 10007121', '10000381 10008989', '10005169 10008989', '10008989 10023817', '10005169 10043265', '10008989 10043265', '10023817 10043265', '10047097 10047137', '10047097 10047265', '10047137 10047265', '10000381 10056453', '10047265 10056453', '10000381 10060557', '10007121 10060557', '10056453 10060557', '10000381 10066013', '10007121 10066013', '10008989 10066013', '10026233 10066013', '10056453 10066013', '10056453 10070153', '10060557 10070153', '10066013 10070153', '10000381 10083798', '10047265 10083798', '10056453 10083798', '10066013 10083798', '10000381 10099969', '10056453 10099969', '10066013 10099969', '10070153 10099969', '10083798 10099969', '10056453 10167029', '10066013 10167029', '10083798 10167029', '10099969 10167029', '10182073 10182085', '10182073 10182177', '10182085 10182177', '10000381 10187233', '10056453 10187233', '10060557 10187233', '10066013 10187233', '10083798 10187233', '10099969 10187233', '10167029 10187233', '10007121 10200685', '10099969 10200685', '10066013 10218005', '10223905 10224013']

I need to find every single instance of each pair of ids and indexing it into a new list. Right now I have a few lines of code that does this for me. However, my list is more than 2,000,000 lines long and will get much bigger as I process more data.

At this moment, the estimated time of completion is about 2 days.

I really just need a much faster method for this.

I'm working in Jupyter Notebooks (on a Mac Laptop)

def compiler(idlist):
    groups = []
    for i in idlist:
        groups.append([index for index, x in enumerate(idlist) if x == i])
    return(groups)

I have also tried:

def compiler(idlist):
    groups = []
    for k,i in enumerate(idlist):
        position = []
        for c,j in enumerate(idlist):
            if i == j:
                position.append(c)
        groups.append(position)
    return(groups)

What I want is something like this:

'10000381 10007121': [0]
'10000381 10008989': [1]
'10005169 10008989': [2, 384775, 864173, 1297105, 1321798, 1555094, 1611064, 2078015]
'10008989 10023817': [3, 1321800]
'10005169 10043265': [4, 29113, 864195, 1297106, 1611081]
[5, 864196, 2078017]
'10008989 10043265': [6, 29114, 384777, 864198, 1611085, 1840733, 2078019]
'10023817 10043265': [7, 86626, 384780, 504434, 792690, 864215, 1297108, 1321801, 1489784, 1524527, 1555096, 1595763, 1611098, 1840734, 1841280, 1929457, 1943701, 1983362, 2093820, 2139917, 2168437] etc. etc. etc.

Where each number in the brackets is an index of that pair in the idlist.

Essentially, I want it to look at a pair of id values ( i.e. '10000381 10007121'), and runs through the list and finds each instance of that pair and documents each index in the list that this pair occurs. I need something that does this for every single item in the list. In a shorter amount of time.

Upvotes: 4

Answers (3)

Chiheb Nexus

Reputation: 9267

If you have a lot of data, i would suggest you using Pypy3 instead of the CPython interpreter and you'll get x5-x7 faster code execution.

Here is an implementation of a time based benchmark using CPython and Pypy3 with 1000 iterations:

Code:

from time import time
from collections import OrderedDict, defaultdict


def timeit(func, iteration=10000):
    def wraps(*args, **kwargs):
        start = time()
        for _ in range(iteration):
            result = func(*args, **kwargs)
        end = time()
        print("func: {name} [{iteration} iterations] took: {elapsed:2.4f} sec".format(
            name=func.__name__,
            iteration=iteration,
            args=args,
            kwargs=kwargs,
            elapsed=(end - start)
        ))
        return result
    return wraps


@timeit
def op_implementation(data):
    groups = []
    for k in data:
        groups.append([index for index, x in enumerate(data) if x == k])
    return groups


@timeit
def ordreddict_implementation(data):
    groups = OrderedDict()
    for k, v in enumerate(data):
        groups.setdefault(v, []).append(k)
    return groups


@timeit
def defaultdict_implementation(data):
    groups = defaultdict(list)
    for k, v in enumerate([x for elm in data for x in elm.split()]):
        groups[v].append(k)
    return groups


@timeit
def defaultdict_implementation_2(data):
    groups = defaultdict(list)
    for k, v in enumerate(map(lambda x: tuple(x.split()), data)):
        groups[v].append(k)
    return groups


@timeit
def dict_implementation(data):
    groups = {}
    for k, v in enumerate([x for elm in data for x in elm.split()]):
        if v in groups:
            groups[v].append(k)
        else:
            groups[v] = [k]
    return groups



if __name__ == '__main__':
    data = [
        '10000381 10007121', '10000381 10008989', '10005169 10008989', '10008989 10023817', 
        '10005169 10043265', '10008989 10043265', '10023817 10043265', '10047097 10047137', 
        '10047097 10047265', '10047137 10047265', '10000381 10056453', '10047265 10056453', 
        '10000381 10060557', '10007121 10060557', '10056453 10060557', '10000381 10066013', 
        '10007121 10066013', '10008989 10066013', '10026233 10066013', '10056453 10066013', 
        '10056453 10070153', '10060557 10070153', '10066013 10070153', '10000381 10083798', 
        '10047265 10083798', '10056453 10083798', '10066013 10083798', '10000381 10099969', 
        '10056453 10099969', '10066013 10099969', '10070153 10099969', '10083798 10099969', 
        '10056453 10167029', '10066013 10167029', '10083798 10167029', '10099969 10167029', 
        '10182073 10182085', '10182073 10182177', '10182085 10182177', '10000381 10187233', 
        '10056453 10187233', '10060557 10187233', '10066013 10187233', '10083798 10187233', 
        '10099969 10187233', '10167029 10187233', '10007121 10200685', '10099969 10200685', 
        '10066013 10218005', '10223905 10224013'
    ]
    op_implementation(data)
    ordreddict_implementation(data)
    defaultdict_implementation(data)
    defaultdict_implementation_2(data)
    dict_implementation(data)

CPython:

func: op_implementation [10000 iterations] took: 1.3096 sec
func: ordreddict_implementation [10000 iterations] took: 0.1866 sec
func: defaultdict_implementation [10000 iterations] took: 0.3311 sec
func: defaultdict_implementation_2 [10000 iterations] took: 0.3817 sec
func: dict_implementation [10000 iterations] took: 0.3231 sec

Pypy3:

func: op_implementation [10000 iterations] took: 0.2370 sec
func: ordreddict_implementation [10000 iterations] took: 0.0243 sec
func: defaultdict_implementation [10000 iterations] took: 0.1216 sec
func: defaultdict_implementation_2 [10000 iterations] took: 0.1299 sec
func: dict_implementation [10000 iterations] took: 0.1175 sec

Pypy3 with 2000000 iterations:

func: op_implementation [200000 iterations] took: 4.6364 sec
func: ordreddict_implementation [200000 iterations] took: 0.3201 sec
func: defaultdict_implementation [200000 iterations] took: 2.2032 sec
func: defaultdict_implementation_2 [200000 iterations] took: 2.4052 sec
func: dict_implementation [200000 iterations] took: 2.2429 sec

Upvotes: 0

a_guest

Reputation: 36339

You can use a collections.OrderedDict in order to reduce the time complexity to O(n). Since it remembers the order of insertion the values resemble the various ids in order of their occurrence:

from collections import OrderedDict

groups = OrderedDict()
for i, v in enumerate(idlist):
    try:
        groups[v].append(i)
    except KeyError:
        groups[v] = [i]

Then list(groups.values()) contains your final result.

Upvotes: 2

Quang Hoang

Reputation: 150825

Instead of a list, use a dict, which makes looking up for existence O(1):

def compiler(idlist):
    groups = {}
    for idx, val in enumerate(idlist):
        if val in groups:  
            groups[val].append(idx)
        else:
            groups[val] = [idx]

Upvotes: 2

Fast method to find indexes of duplicates in a lists &gt;2000000 items

Answers (3)

Related Questions

Fast method to find indexes of duplicates in a lists >2000000 items