Reputation: 1481
Is it possible to do a find/replace using regular expressions on a string of dna such that it only considers every 3 characters (a codon of dna) at a time.
for example I would like the regular expression to see this:
dna="AAACCCTTTGGG"
as this:
AAA CCC TTT GGG
If I use the regular expressions right now and the expression was
Regex.Replace(dna,"ACC","AAA") it would find a match, but in this case of looking at 3 characters at a time there would be no match.
Is this possible?
Upvotes: 2
Views: 701
Reputation: 56819
It is possible to do this with regex. Assuming the input is valid (contains only A
, T
, G
, C
):
Regex.Replace(input, @"\G((?:.{3})*?)" + codon, "$1" + replacement);
If the input is not guaranteed to be valid, you can just do a check with the regex ^[ATCG]*$
(allow non-multiple of 3) or ^([ATCG]{3})*$
(sequence must be multiple of 3). It doesn't make sense to operate on invalid input anyway.
The construction above works for any codon. For the sake of explanation, let the codon be AAA
. The regex will be \G((?:.{3})*?)AAA
.
The whole regex actually matches the shortest substring that ends with the codon to be replaced.
\G # Must be at beginning of the string, or where last match left off
((?:.{3})*?) # Match any number of codon, lazily. The text is also captured.
AAA # The codon we want to replace
We make sure the matches only starts from positions whose index is multiple of 3 with:
\G
which asserts that the match starts from where the previous match left off (or the beginning of the string)((?:.{3})*?)AAA
can only match a sequence whose length is multiple of 3.Due to the lazy quantifier, we can be sure that in each match, the part before the codon to be replaced (matched by ((?:.{3})*?)
part) does not contain the codon.
In the replacement, we put back the part before the codon (which is captured in capturing group 1 and can be referred to with $1
), follows by the replacement codon.
Upvotes: 1
Reputation: 11893
Why use a regex? Try this instead, which is probably more efficient to boot:
public string DnaReplaceCodon(string input, string match, string replace) {
if (match.Length != 3 || replace.Length != 3)
throw new ArgumentOutOfRangeException();
var output = new StringBuilder(input.Length);
int i = 0;
while (i + 2 < input.Length) {
if (input[i] == match[0] && input[i+1] == match[1] && input[i+2] == match[2]) {
output.Append(replace);
} else {
output.Append(input[i]);
output.Append(input[i]+1);
output.Append(input[i]+2);
}
i += 3;
}
// pick up trailing letters.
while (i < input.Length) output.Append(input[i]);
return output.ToString();
}
Upvotes: 1
Reputation: 11712
NOTE
As explained in the comment, the following is not a good solution! I leave it in so that others will not fall for the same mistake
You can usually find out where a match starts and ends via m.start()
and m.end()
. If m.start() % 3 == 0
you found a relevant match.
Upvotes: 0