Reputation: 21
I want to iterate over kmers list and select items only contains character A , T, G and C
kmers=["AL","AT","GC","AA","AP"]
for kmer in kmers:
for letter in kmer:
if letter not in ["A","T","G","C"]:
pass
else:
DNA_kmers.append(kmer)
print("DNA_kmers",DNA_kmers)
output:
DNA_kmers ['AL', 'AT', 'AT', 'GC', 'GC', 'AA', 'AA', 'AP']
desired output:
DNA_kmers=["AT","GC","AA"]
The only method i know is
if "B" in kmer or "D" in kmer or "E" in kmer or "F" in kmer or "H" in kmer or "I" in kmer or "J" in kmer or "K" in kmer or "L" in kmer or "M" in kmer or "N" in kmer or "O" in kmer or "P" in kmer or "Q" in kmer or "R" in kmer or "S" in kmer or "U" in kmer or "V" in kmer or "W" in kmer or "X" in kmer or "Y" in kmer or "Z" in kmer:
pass
Upvotes: 2
Views: 1153
Reputation: 9132
You code will currently add any items where either character is a match. We can adjust it to add only items where both characters match:
kmers=["AL","AT","GC","AA","AP"]
DNA_kmers =[]
for kmer in kmers:
for letter in kmer:
if letter not in ["A","T","G","C"]:
break
else:
DNA_kmers.append(kmer)
print("DNA_kmers",DNA_kmers)
If you aren't familiar with Python, I've made use of the else
clause on the for
loop. This isn't available in all languages. The else
block will be run if and only if the loop completes all iterations.
There are significantly simpler ways to do what you are trying to do. For example, the following will get the job done using a nested list comprehension:
kmers=["AL","AT","GC","AA","AP"]
allowed = set("AGCT")
print([k for k in kmers if all([c in allowed for c in k])])
A more performant general-purpose solution is to use regular expressions:
import re
kmers=["AL","AT","GC","AA","AP"]
r = re.compile("^[ATGC]*$")
print([k for k in kmers if r.match(k)])
If we limit the problem to only k-mers where k=2, we can further optimize the performance. The regex performance should increase slightly if matching a fixed length string, such as using [AGCT]{2}
. We can also use product
to create a set to use for constant time lookups:
import itertools
kmers=["AL","AT","GC","AA","AP"]
allowed = {a+b for a,b in itertools.product("AGCT", repeat=2)}
print([k for k in kmers if k in allowed])
Upvotes: 2