Reputation: 511
Regular expression for comma separated sub-string permutations
Hi,
I would like to define a regular expression that matches strings consisting of two sub-strings separated by a single comma. Each sub-string may not be empty, and consists exclusively and without repetition of the characters 'A', 'G', 'C' and 'T'. Thus, the pattern should match strings such as:
A,G
AG,CT
TC,CA <- correct, 1st and 2nd sub-strings may have characters in common
( as long as these are not repeated within the sub-string ).
GAT,CGA
CGAT,TG <- correct, sub-strings may be of different length.
etc ...
and should not match:
,G <- missing 1st sub-string
ACGT <- missing comma
X,A <- incorrect character X
AA,G <- repetition of character A in 1st substring
AC,GGC <- repetition of character G in 2nd sub-string
ATGA,TGG <- repetition in both sub-strings
ATCXG,AAC <- incorrect character X and repetition in 2nd sub-string
etc ...
So far I have:
/^(?=[ACGT]{1,4},[ACGT]{1,4}$)(?!.*(.).*\1.*,)(?!,.*(.).*\1).*$/
/^(?=[ACGT]{1,4},[ACGT]{1,4}$)(?!.*(.).*\g{1}.*,)(?!,.*(.).*\g{1}).*$/
also tried joining the capture groups with
/^(?=[ACGT]{1,4},[ACGT]{1,4}$)(?!.*(.).*\g{1}.*,.*(.).*\g{2}).*$/
Now, (?=[ACGT]{1,4},[ACGT]{1,4}$)
seems to match the "two sub-strings separated by a single comma" and "consists exclusively of the characters 'A', 'G', 'C' and 'T'" through out the string; (?!.*(.).*\1.*,)
seems to match "without repetition" up to the comma.
However, (?!,.*(.).*\1)
appears not to be ensuring that it doesn't match a repeated character after the comma.
I'd greatly appreciate replies with clues and/or patterns that help with the desired matching.
Using perl v5.18.2
Thanks in advance
Robert
Upvotes: 1
Views: 728
Reputation:
I think you are pretty close. This should work as well.
It basically does what @Miller's does.
updated - a condensed version.
# /(?m)^(?:(?:^|,)(?:([AGCT])(?![AGCT]*\1)){1,4}){2}$/
(?m) # Multiline mode
^ # BOL
(?: # Total cluster
(?: ^ | , ) # BOL or comma
(?: # AGCT Cluster grp
( [AGCT] ) # (1), Capture single character [AGCT]
(?! # Negative lookahead
[AGCT]* # As many [AGCT] needed
\1 # to find what is captured in group 1
) # End Negative lookahead
){1,4} # End AGCT Cluster grp 1-4 characters
){2} # Total cluster, do 2 times
$ # EOL
Upvotes: 1
Reputation: 35198
Break your problem into steps.
First look for allowed format and characters. Then check to make sure there is no repetition.
use strict;
use warnings;
while (<DATA>) {
print if /^[ACGT]+,[ACGT]+$/ && !/(\w)\w*\1/;
}
__DATA__
A,G
AG,CT
TC,CA
GAT,CGA
CGAT,TG
,G
ACGT
X,A
AA,G
AC,GGC
ATGA,TGG
ATCXG,AAC
Outputs:
A,G
AG,CT
TC,CA
GAT,CGA
CGAT,TG
Upvotes: 4