Bade
Bade

Reputation: 747

Remove duplicate entries in list using python

I have a big file with entries as opened in python as:

 fh_in=open('/xzy/abc', 'r') 
 parsed_in=csv.reader(fh_in, delimiter=',')
 for element in parsed_in:
  print(element)

RESULT:

['ABC', 'chr9', '3468582', 'NAME1', 'UGA', 'GGU']

['DEF', 'chr9', '14855289', NAME19', 'UCG', 'GUC']

['TTC', 'chr9', '793946', 'NAME178', 'CAG', 'GUC']

['ABC', 'chr9', '3468582', 'NAME272', 'UGT', 'GCU']

I have to extract only the unique entries and to remove entries with same values in col1, col2 and col3. Like in this case last line is same as line 1 on the basis of col1, col2 and col3.

I have tried two methods but failed:

Method 1:

outlist=[]

for element in parsed_in:     
  if element[0:3] not in outlist[0:3]:
    outlist.append(element)

Method 2:

outlist=[]
parsed_list=list(parsed_in)
for element in range(0,len(parsed_list)):
  if parsed_list[element] not in parsed_list[element+1:]:
    outlist.append(parsed_list[element])

These both gives back all the entries and not unique entries on basis of first 3 columns.

Please suggest me a way to do so

AK

Upvotes: 1

Views: 2324

Answers (2)

Crast
Crast

Reputation: 16316

You probably want to use an O(1) lookup to save yourself a full scan of the elements while adding, and like Caol Acain said, sets is a good way to do it.

What you want to do is something like:

outlist=[]
added_keys = set()

for row in parsed_in:
    # We use tuples because they are hashable
    lookup = tuple(row[:3])    
    if lookup not in added_keys:
        outlist.append(row)
        added_keys.add(lookup)

You could alternately have used a dictionary mapping the key to the row, but this would have the caveat that you would not preserve the ordering of the input, so having the list and the key set allows you to keep the ordering as in-file.

Upvotes: 3

Caol Acain
Caol Acain

Reputation: 1

Convert your lists to sets!

http://docs.python.org/tutorial/datastructures.html#sets

Upvotes: 0

Related Questions