Reputation: 51
I am a molecular biologist using Biopython to analyze mutations in genes and my problem is this:
I have a file containing many different sequences (millions), most of which are duplicates. I need to find the duplicates and discard them, keeping one copy of each unique sequence. I was planning on using the module editdist to calculate the edit distance between them all to determine which ones the duplicates are, but editdist can only work with 2 strings, not files.
Anyone know how I can use that module with files instead of strings?
Upvotes: 3
Views: 1362
Reputation: 442
Does it have to be Python?
If the sequences are simply text strings one per line then a shell script will be very efficient:
sort input-file-name | uniq > output-file-name
This will do the job on files up to 2GB on 32 bit Linux.
If you are on Windows then install the GNU utils http://gnuwin32.sourceforge.net/summary.html.
Upvotes: 1
Reputation: 1225
Assuming your file consists solely of sequences arranged one sequence per line, I would suggest the following:
seq_file = open(#your file)
sequences = [seq for seq in seq_file]
uniques = list(set(sequences))
Assuming you have the memory for it. How many millions?
ETA:
Was reading the comments above (but don't have comment privs) - assuming the sequence IDs are the same for any duplicates, this will work. If duplicate sequences can different sequence IDs, then would to know which comes first and what is between them in the file.
Upvotes: 2
Reputation: 11614
Don't be afraid of files! ;-)
I'm posting an example by assuming the following:
-
filename = 'sequence.txt'
with open(filename, 'r') as sqfile:
sequences = sqfile.readlines() # now we have a list of strings
#discarding the duplicates:
uniques = list(set(sequences))
That's it - by using pythons set-type we eliminate all duplicates automagically.
if you have the id and the sequence in the same line like:
423401 ttacguactg
you may want to eliminate the ids like:
sequences = [s.strip().split()[-1] for s in sequences]
with strip we strip the string from leading and trailing whitespaces and with split we split the line/string into 2 components: the id, and the sequence. with the [-1] we select the last component (= the sequence-string) and repack it into our sequence-list.
Upvotes: 0
Reputation: 2909
Four things come to mind:
Upvotes: 0
Reputation: 7317
If you want to filter out exact duplicates, you can use the set
Python built-in type. As an example :
a = ["tccggatcc", "actcctgct", "tccggatcc"] # You have a list of sequences
s = set(a) # Put that into a set
s
is then equal to ['tccggatcc', 'actcctgct']
, without duplicates.
Upvotes: 1