Lucas Kauffman
Lucas Kauffman

Reputation: 6891

Keep a list to prevent duplicates efficiency in Python

I'm having some trouble with a csv datasource which contains some duplicate IDs. The final result however should only have the ID once. Therefore it was decided that we should only take the first instance we see and ignore any other instances.

Currently my code is abit like this:

id_list = list()
for item in datasource: 
    if item[0] not in id_list:
        #process
        id_list.append(item[0])

The problem is that when the list grows, performance drops. I'm wondering if there are more efficient ways of tracking the already processed IDs?

Upvotes: 0

Views: 319

Answers (3)

thefourtheye
thefourtheye

Reputation: 239473

Use a set object, sets are guaranteed to not have duplicates and provide fast membership testing. You can use set like this

id_list = set()
for item in datasource: 
    if item[0] not in id_list:
        # process
        id_list.add(item[0])

This will be better, since the lookup in set objects will happen in constant time, as opposed to the linear time lookup in lists.

Upvotes: 5

Stefan
Stefan

Reputation: 2518

With reference to this question, I would suggest to use a dict.

Especially, the case of unique keys seems appropriate.

You could then try something like:

if key not in dict:
    [insert values in dict]

Upvotes: 1

d3dave
d3dave

Reputation: 1381

Instead of using a list you can use a binary search tree ordered by the ID.

Upvotes: 0

Related Questions