Is there an efficient way to select certain characters from a string?

Question

I have the following sequences, as you might guess I only want the characters A,C,G,T,- and ignore the rest. Length of the input file may change and there may be a line such as sequence747 etc.

"sequence1 ATAC---CTAATTGGAGATGATCAAATTTATAAT"

"sequence2 TTAT---CTAATTGGCGACGATCAAATTTATAAT"

"sequence3 ATAT---CTAATTGGTGATGATCAAATTTATAAT"

"sequence4 ATCA---TTAATTGGAGATGATCAATCCTAATGA"

"sequence5 CTGTACTTTTATTGGTGATAGTCAAATCTATAAT"

So far I have tried this and it works as I wanted:

aligned_lines = []
for line in lines:
    temp_str = ""
    for element in line:
        if element == 'A' or element == 'C' or element == 'G' or element == 'T' or element == '-':
            temp_str += element
    aligned_lines.append(temp_str)

But is there any more efficient way to do it with less code? Any built in python tool to construct new string from another string with certain characters in python?

user2390182 · Accepted Answer

One more efficient way would be to re.split (using regular expression) the string on unwanted chunks and str.join the remainders back together:

import re

pat = re.compile(r"[^ACGT-]+")

for line in lines:
    aligned_lines.append(''.join(pat.split(line)))
    # even better, as suggested: pat.sub('', s)

Some explanation on the regex:

"[^XYZ]"    # anything that is NOT X or Y or Z
"X+"        # one or more X
"[^ACGT]+"  # one or more of anything not A, C, G, or T

Some more notes:
What makes your approach particularly inefficient is the incremental construction of a string. As strings are immutable, temp_str += element must create a new str object which is an expensive operation (~len(temp_str)). This makes the entire process quadratic. You would already have a linear approach if you collected your elements in a list and joined that list into a str in the end. And, of course, your big if-condition can be shortened dramatically, e.g. if element in "ACGT":. Hence, your approach in an improved version would go along the following lines:

for line in lines:
    aligned_lines.append(''.join(e for e in line if e in "ACGT"))

This, in turn, could be fancified:

for line in lines:
    aligned_lines.append(''.join(filter("ACGT".__contains__, line)))

Is there an efficient way to select certain characters from a string?

Answers (2)

Related Questions