azazelspeaks
azazelspeaks

Reputation: 6045

Python: Fastest way to search if long string is in list of strings

I have an input of about 2-5 millions strings of about 400 characters each, coming from a stored text file. I need to check for duplicates before adding them to the list that I check (doesn't have to be a list, can be any other data type, the list is technically a set since all items are unique).

I can expect about 0.01% at max of my data to be non-unique and I need to filter them out.

I'm wondering if there is any faster way for me to check if the item exists in the list rather than:

a=[]
for item in data:
    if item not in a:
        a.add(item)

I do not want to lose the order.

Would hashing be faster (I don't need encryption)? But then I'd have to maintain a hash table for all the values to check first. Is there any way I'm missing?

I'm on python 2, can at max go upto python 3.5.

Upvotes: 0

Views: 2322

Answers (2)

Tim Peters
Tim Peters

Reputation: 70602

It's hard to answer this question because it keeps changing ;-) The version I'm answering asks whether there's a faster way than:

a=[]
for item in data:
    if item not in a:
        a.add(item)

That will be horridly slow, taking time quadratic in len(data). In any version of Python the following will take expected-case time linear in len(data):

seen = set()
for item in data:
    if item not in seen:
        seen.add(item)
        emit(item)

where emit() does whatever you like (append to a list, write to a file, whatever).

In comments I already noted ways to achieve the same thing with ordered dictionaries (whether ordered by language guarantee in Python 3.7, or via the OrderedDict type from the collections package). The code just above is the most memory-efficient, though.

Upvotes: 3

samandal
samandal

Reputation: 87

You can try this,

a = list(set(data))

A List is an ordered sequence of elements whereas Set is a distinct list of elements which is unordered.

Upvotes: -1

Related Questions