techdog
techdog

Reputation: 1481

use regular expression to find and replace but only every 3 characters for DNA sequence

Is it possible to do a find/replace using regular expressions on a string of dna such that it only considers every 3 characters (a codon of dna) at a time.

for example I would like the regular expression to see this:
dna="AAACCCTTTGGG"
as this:
AAA CCC TTT GGG

If I use the regular expressions right now and the expression was
Regex.Replace(dna,"ACC","AAA") it would find a match, but in this case of looking at 3 characters at a time there would be no match.

Is this possible?

Upvotes: 2

Views: 701

Answers (3)

nhahtdh
nhahtdh

Reputation: 56819

Solution

It is possible to do this with regex. Assuming the input is valid (contains only A, T, G, C):

Regex.Replace(input, @"\G((?:.{3})*?)" + codon, "$1" + replacement);

DEMO

If the input is not guaranteed to be valid, you can just do a check with the regex ^[ATCG]*$ (allow non-multiple of 3) or ^([ATCG]{3})*$ (sequence must be multiple of 3). It doesn't make sense to operate on invalid input anyway.

Explanation

The construction above works for any codon. For the sake of explanation, let the codon be AAA. The regex will be \G((?:.{3})*?)AAA.

The whole regex actually matches the shortest substring that ends with the codon to be replaced.

\G            # Must be at beginning of the string, or where last match left off
((?:.{3})*?)  # Match any number of codon, lazily. The text is also captured.
AAA           # The codon we want to replace

We make sure the matches only starts from positions whose index is multiple of 3 with:

  • \G which asserts that the match starts from where the previous match left off (or the beginning of the string)
  • And the fact that the pattern ((?:.{3})*?)AAA can only match a sequence whose length is multiple of 3.

Due to the lazy quantifier, we can be sure that in each match, the part before the codon to be replaced (matched by ((?:.{3})*?) part) does not contain the codon.

In the replacement, we put back the part before the codon (which is captured in capturing group 1 and can be referred to with $1), follows by the replacement codon.

Upvotes: 1

Pieter Geerkens
Pieter Geerkens

Reputation: 11893

Why use a regex? Try this instead, which is probably more efficient to boot:

public string DnaReplaceCodon(string input, string match, string replace) {
  if (match.Length != 3  || replace.Length != 3) 
      throw new ArgumentOutOfRangeException();

  var output = new StringBuilder(input.Length);
  int i = 0;
  while (i + 2 < input.Length) {
    if (input[i] == match[0] && input[i+1] == match[1] && input[i+2] == match[2]) {
      output.Append(replace);
    } else {
      output.Append(input[i]);
      output.Append(input[i]+1);
      output.Append(input[i]+2);
    }

    i += 3;
  }

  // pick up trailing letters.
  while (i < input.Length)   output.Append(input[i]);

  return output.ToString();
}

Upvotes: 1

luksch
luksch

Reputation: 11712

NOTE

As explained in the comment, the following is not a good solution! I leave it in so that others will not fall for the same mistake

You can usually find out where a match starts and ends via m.start() and m.end(). If m.start() % 3 == 0 you found a relevant match.

Upvotes: 0

Related Questions