snurfx
snurfx

Reputation: 23

filtering several lists from reference list

I have 4 lists of about 150M lines each and I need to filter them out by duplicates on list one.

lst1=['1','2','1','3']
lst2=['a','b','c','d']
lst3=['a','b','c','d']
lst4=['a','b','c','d']

magical code that give me this:

lst1=['1','2','3']
lst2=['a','b','d']
lst3=['a','b','d']
lst4=['a','b','d']

I tried to:

    set1=list(set(list1))
    newlst2=[]
    newlst3=[]
    newlst4=[]
    for i in set1:
        newlst2.append(lst2[lst1.index(i)])
        newlst3.append(lst3[lst1.index(i)])
        newlst4.append(lst4[lst1.index(i)])

the problem is this takes forever due to the huge lists I am using. is there any way to optimize this?

I apologize the archaic way of coding but I am a life-sciences scientist :)

EDIT for clarification: the lists are not independent, lst1[1],lst2[1],lst3[1],lst4[1] are 4 measurements of the same "thing". lst1 is a unique identifier that must only appear once, hence the need to remove duplicates and expand that removal to the other lists. ie.: removing 1 from lst1 (because its duplicated) should lead to the removal of "c" from lst2, lst3 and lst4 because they are in the same position. lst1[3] is duplicated --> so lst1[3], lst2[3], lst3[3] and lst[4] are removed.

picking on Chris_Rands answer:

from collections import OrderedDict
from operator import itemgetter

def filter_lists(master_lst, lst2, lst3, lst4):
    unique = list(OrderedDict.fromkeys(master_lst))
    unique_idx = [master_lst.index(item) for item in unique]
    for lst in (master_lst, lst2, lst3, lst4):
        yield list(itemgetter(*unique_idx)(lst))

lst1=['1','2','1','3']
lst2=['a','b','c','d']
lst3=['a','b','c','d']
lst4=['a','b','c','d']

print(list(filter_lists(lst1, lst2, lst3, lst4)))
# [['1', '2', '3'], ['a', 'b', 'd'], ['a', 'b', 'd'], ['a', 'b', 'd']]

the slow part of the process is still to index every element of unique against the master_lst. Since OrderedDict returns retains the initial order of the list I used this instead of index:

unique_idx = []
total_counts=len(unique)
master_count=0
unique_count=0

while total_counts>0:
    if unique[unique_count]==master_lst[master_count]:
        unique_idx.append(master_count)
        unique_count=unique_count+1
        total_counts=total_counts-1
    master_count=master_count+1

it seems faster. wow, creating the unique_idx went from several hours to a couple of seconds!

Upvotes: 2

Views: 106

Answers (1)

Chris_Rands
Chris_Rands

Reputation: 41168

One approach would be uniquely your list (while preserving the order) using OrderedDict.fromkeys(), then get the unique indexes using list.index, then simply iterate over all lists and extract the unique indexes using itemgetter():

from collections import OrderedDict
from operator import itemgetter

def filter_lists(master_lst, lst2, lst3, lst4):
    unique = list(OrderedDict.fromkeys(master_lst))
    unique_idx = [master_lst.index(item) for item in unique]
    for lst in (master_lst, lst2, lst3, lst4):
        yield list(itemgetter(*unique_idx)(lst))

lst1=['1','2','1','3']
lst2=['a','b','c','d']
lst3=['a','b','c','d']
lst4=['a','b','c','d']

print(list(filter_lists(lst1, lst2, lst3, lst4)))
# [['1', '2', '3'], ['a', 'b', 'd'], ['a', 'b', 'd'], ['a', 'b', 'd']]

Upvotes: 4

Related Questions