Reputation: 6045
I have an input of about 2-5 millions strings of about 400 characters each, coming from a stored text file. I need to check for duplicates before adding them to the list that I check (doesn't have to be a list, can be any other data type, the list is technically a set since all items are unique).
I can expect about 0.01% at max of my data to be non-unique and I need to filter them out.
I'm wondering if there is any faster way for me to check if the item exists in the list rather than:
a=[]
for item in data:
if item not in a:
a.add(item)
I do not want to lose the order.
Would hashing be faster (I don't need encryption)? But then I'd have to maintain a hash table for all the values to check first. Is there any way I'm missing?
I'm on python 2, can at max go upto python 3.5.
Upvotes: 0
Views: 2322
Reputation: 70602
It's hard to answer this question because it keeps changing ;-) The version I'm answering asks whether there's a faster way than:
a=[]
for item in data:
if item not in a:
a.add(item)
That will be horridly slow, taking time quadratic in len(data)
. In any version of Python the following will take expected-case time linear in len(data)
:
seen = set()
for item in data:
if item not in seen:
seen.add(item)
emit(item)
where emit()
does whatever you like (append to a list, write to a file, whatever).
In comments I already noted ways to achieve the same thing with ordered dictionaries (whether ordered by language guarantee in Python 3.7, or via the OrderedDict
type from the collections
package). The code just above is the most memory-efficient, though.
Upvotes: 3
Reputation: 87
You can try this,
a = list(set(data))
A List is an ordered sequence of elements whereas Set is a distinct list of elements which is unordered.
Upvotes: -1