Reputation: 6891
I'm having some trouble with a csv datasource which contains some duplicate IDs. The final result however should only have the ID once. Therefore it was decided that we should only take the first instance we see and ignore any other instances.
Currently my code is abit like this:
id_list = list()
for item in datasource:
if item[0] not in id_list:
#process
id_list.append(item[0])
The problem is that when the list grows, performance drops. I'm wondering if there are more efficient ways of tracking the already processed IDs?
Upvotes: 0
Views: 319
Reputation: 239473
Use a set
object, sets are guaranteed to not have duplicates and provide fast membership testing. You can use set
like this
id_list = set()
for item in datasource:
if item[0] not in id_list:
# process
id_list.add(item[0])
This will be better, since the lookup in set
objects will happen in constant time, as opposed to the linear time lookup in lists.
Upvotes: 5
Reputation: 2518
With reference to this question, I would suggest to use a dict.
Especially, the case of unique keys seems appropriate.
You could then try something like:
if key not in dict:
[insert values in dict]
Upvotes: 1
Reputation: 1381
Instead of using a list you can use a binary search tree ordered by the ID.
Upvotes: 0