user1513202
user1513202

Reputation: 51

Python edit distance

I am a molecular biologist using Biopython to analyze mutations in genes and my problem is this:

I have a file containing many different sequences (millions), most of which are duplicates. I need to find the duplicates and discard them, keeping one copy of each unique sequence. I was planning on using the module editdist to calculate the edit distance between them all to determine which ones the duplicates are, but editdist can only work with 2 strings, not files.

Anyone know how I can use that module with files instead of strings?

Upvotes: 3

Views: 1362

Answers (5)

Kevin Whitefoot
Kevin Whitefoot

Reputation: 442

Does it have to be Python?

If the sequences are simply text strings one per line then a shell script will be very efficient:

sort input-file-name | uniq > output-file-name

This will do the job on files up to 2GB on 32 bit Linux.

If you are on Windows then install the GNU utils http://gnuwin32.sourceforge.net/summary.html.

Upvotes: 1

selllikesybok
selllikesybok

Reputation: 1225

Assuming your file consists solely of sequences arranged one sequence per line, I would suggest the following:

seq_file = open(#your file)

sequences = [seq for seq in seq_file]

uniques = list(set(sequences))

Assuming you have the memory for it. How many millions?

ETA:

Was reading the comments above (but don't have comment privs) - assuming the sequence IDs are the same for any duplicates, this will work. If duplicate sequences can different sequence IDs, then would to know which comes first and what is between them in the file.

Upvotes: 2

Don Question
Don Question

Reputation: 11614

Don't be afraid of files! ;-)

I'm posting an example by assuming the following:

  1. its a text-file
  2. one sequence per line

-

filename = 'sequence.txt'
with open(filename, 'r') as sqfile:
   sequences = sqfile.readlines() # now we have a list of strings

#discarding the duplicates:
uniques = list(set(sequences))

That's it - by using pythons set-type we eliminate all duplicates automagically.

if you have the id and the sequence in the same line like:

423401 ttacguactg

you may want to eliminate the ids like:

sequences = [s.strip().split()[-1] for s in sequences]

with strip we strip the string from leading and trailing whitespaces and with split we split the line/string into 2 components: the id, and the sequence. with the [-1] we select the last component (= the sequence-string) and repack it into our sequence-list.

Upvotes: 0

user1277476
user1277476

Reputation: 2909

Four things come to mind:

  1. You can use a set(), as described by F.X. - assuming the unique strings will all fit in memory
  2. You can use one file per sequence, and feed the files to a program like equivs3e: http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html#python-3e
  3. You could perhaps use a gdbm as a set, instead of its usual key-value store use. This is good if you need something that's 100% accurate, but you have too much data to fit all the uniques in Virtual Memory.
  4. You could perhaps use a bloom filter to cut down the data to more manageable sizes, if you have a truly huge number of strings to check and lots of duplicates. Basically a bloom filter can say "This is definitely not in the set" or "This is almost definitely in the set". In this way, you can eliminate most of the obvious duplicates before using a more common means to operate on the remaining elements. http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/

Upvotes: 0

F.X.
F.X.

Reputation: 7317

If you want to filter out exact duplicates, you can use the set Python built-in type. As an example :

a = ["tccggatcc", "actcctgct", "tccggatcc"] # You have a list of sequences
s = set(a) # Put that into a set

s is then equal to ['tccggatcc', 'actcctgct'], without duplicates.

Upvotes: 1

Related Questions