Reputation: 15
I have the following sequences, as you might guess I only want the characters A,C,G,T,- and ignore the rest. Length of the input file may change and there may be a line such as sequence747 etc.
"sequence1 ATAC---CTAATTGGAGATGATCAAATTTATAAT"
"sequence2 TTAT---CTAATTGGCGACGATCAAATTTATAAT"
"sequence3 ATAT---CTAATTGGTGATGATCAAATTTATAAT"
"sequence4 ATCA---TTAATTGGAGATGATCAATCCTAATGA"
"sequence5 CTGTACTTTTATTGGTGATAGTCAAATCTATAAT"
So far I have tried this and it works as I wanted:
aligned_lines = []
for line in lines:
temp_str = ""
for element in line:
if element == 'A' or element == 'C' or element == 'G' or element == 'T' or element == '-':
temp_str += element
aligned_lines.append(temp_str)
But is there any more efficient way to do it with less code? Any built in python tool to construct new string from another string with certain characters in python?
Upvotes: 0
Views: 62
Reputation: 73480
One more efficient way would be to re.split
(using regular expression) the string on unwanted chunks and str.join
the remainders back together:
import re
pat = re.compile(r"[^ACGT-]+")
for line in lines:
aligned_lines.append(''.join(pat.split(line)))
# even better, as suggested: pat.sub('', s)
Some explanation on the regex:
"[^XYZ]" # anything that is NOT X or Y or Z
"X+" # one or more X
"[^ACGT]+" # one or more of anything not A, C, G, or T
Some more notes:
What makes your approach particularly inefficient is the incremental construction of a string. As strings are immutable, temp_str += element
must create a new str
object which is an expensive operation (~len(temp_str)
). This makes the entire process quadratic. You would already have a linear approach if you collected your elements
in a list
and joined that list into a str
in the end.
And, of course, your big if
-condition can be shortened dramatically, e.g. if element in "ACGT":
. Hence, your approach in an improved version would go along the following lines:
for line in lines:
aligned_lines.append(''.join(e for e in line if e in "ACGT"))
This, in turn, could be fancified:
for line in lines:
aligned_lines.append(''.join(filter("ACGT".__contains__, line)))
Upvotes: 2
Reputation: 1358
As you'd like to get only the second half of each line, I think a simple str.split()
will do it for you:
lines=[
"sequence1 ATAC---CTAATTGGAGATGATCAAATTTATAAT",
"sequence2 TTAT---CTAATTGGCGACGATCAAATTTATAAT",
"sequence3 ATAT---CTAATTGGTGATGATCAAATTTATAAT",
"sequence4 ATCA---TTAATTGGAGATGATCAATCCTAATGA",
"sequence5 CTGTACTTTTATTGGTGATAGTCAAATCTATAAT"]
for line in lines:
ignore,keep = line.split()
# The result will be in the keep variable
Upvotes: 0