Slow list parsing with python3.7 for duplicate item removal

Question

I'm trying to remove duplicate items from a large text file containing 250 million items at 4.4 Gigabytes.

I was impressed to see that I could load this file into a python list in just a few minutes with the following code:

x = []

with open("online.txt") as file:
    for l in file:
       x.append(l)

    print('count of array: ')
    print(len(x))

But when I tried to simply check to make sure the next item doesn't exist before added it to an array, it's taking many hours to finish. I feel like I'm missing something simple that would really speed this up.

Here's the code I used to check for duplicate items:

a = []
x = []

with open("online.txt") as file:
    for l in file:
        if l in a:
            print('duplicate')
            print(l)
        else:
            x.append(l.strip())
        a.append(l)

    print('with duplicates: ');
    print(len(a))
    print('without duplicates: ')
    print(len(x))

This is running on a server with 64 Gigs of ram and modern dual xeon processors.

Kingsley · Accepted Answer

The problem is with a simple list, python has to search through every entry each time before adding a new one.

You could try a python dictionary or a set instead of a list. These data structures are faster for determining if an entry exists already.

Simply change your code:

a = {}  # set
x = {}

with open("online.txt") as file:
    for l in file:
        if l in a:
            print('duplicate')
            print(l)
        else:
            x.add(l.strip())  # add to the set
        a.add(l)

You don't specify your input file-format, but there may be speed increases possibly by loading the whole data-set into a giant string, then splitting it up with python functions, rather than manually like you do here.

Slow list parsing with python3.7 for duplicate item removal

Answers (2)

Related Questions