Reputation: 23
I have 4 lists of about 150M lines each and I need to filter them out by duplicates on list one.
lst1=['1','2','1','3']
lst2=['a','b','c','d']
lst3=['a','b','c','d']
lst4=['a','b','c','d']
magical code that give me this:
lst1=['1','2','3']
lst2=['a','b','d']
lst3=['a','b','d']
lst4=['a','b','d']
I tried to:
set1=list(set(list1))
newlst2=[]
newlst3=[]
newlst4=[]
for i in set1:
newlst2.append(lst2[lst1.index(i)])
newlst3.append(lst3[lst1.index(i)])
newlst4.append(lst4[lst1.index(i)])
the problem is this takes forever due to the huge lists I am using. is there any way to optimize this?
I apologize the archaic way of coding but I am a life-sciences scientist :)
EDIT for clarification: the lists are not independent, lst1[1],lst2[1],lst3[1],lst4[1] are 4 measurements of the same "thing". lst1 is a unique identifier that must only appear once, hence the need to remove duplicates and expand that removal to the other lists. ie.: removing 1 from lst1 (because its duplicated) should lead to the removal of "c" from lst2, lst3 and lst4 because they are in the same position. lst1[3] is duplicated --> so lst1[3], lst2[3], lst3[3] and lst[4] are removed.
picking on Chris_Rands answer:
from collections import OrderedDict
from operator import itemgetter
def filter_lists(master_lst, lst2, lst3, lst4):
unique = list(OrderedDict.fromkeys(master_lst))
unique_idx = [master_lst.index(item) for item in unique]
for lst in (master_lst, lst2, lst3, lst4):
yield list(itemgetter(*unique_idx)(lst))
lst1=['1','2','1','3']
lst2=['a','b','c','d']
lst3=['a','b','c','d']
lst4=['a','b','c','d']
print(list(filter_lists(lst1, lst2, lst3, lst4)))
# [['1', '2', '3'], ['a', 'b', 'd'], ['a', 'b', 'd'], ['a', 'b', 'd']]
the slow part of the process is still to index every element of unique against the master_lst. Since OrderedDict returns retains the initial order of the list I used this instead of index:
unique_idx = []
total_counts=len(unique)
master_count=0
unique_count=0
while total_counts>0:
if unique[unique_count]==master_lst[master_count]:
unique_idx.append(master_count)
unique_count=unique_count+1
total_counts=total_counts-1
master_count=master_count+1
it seems faster. wow, creating the unique_idx went from several hours to a couple of seconds!
Upvotes: 2
Views: 106
Reputation: 41168
One approach would be uniquely your list (while preserving the order) using OrderedDict.fromkeys()
, then get the unique indexes using list.index
, then simply iterate over all list
s and extract the unique indexes using itemgetter()
:
from collections import OrderedDict
from operator import itemgetter
def filter_lists(master_lst, lst2, lst3, lst4):
unique = list(OrderedDict.fromkeys(master_lst))
unique_idx = [master_lst.index(item) for item in unique]
for lst in (master_lst, lst2, lst3, lst4):
yield list(itemgetter(*unique_idx)(lst))
lst1=['1','2','1','3']
lst2=['a','b','c','d']
lst3=['a','b','c','d']
lst4=['a','b','c','d']
print(list(filter_lists(lst1, lst2, lst3, lst4)))
# [['1', '2', '3'], ['a', 'b', 'd'], ['a', 'b', 'd'], ['a', 'b', 'd']]
Upvotes: 4