Reputation: 51
I am trying to write a regular expression to match strings that contain x, y or z, but only 1-2 of them can be in it.
For example:
valid strings = xxxx, xxxyyyy, xyxyx, zyzzzyyy, xzzzxx
.
invalid strings = xyz, xxxyyyyz, zxzyy
I was initially writing it as follows
regex = re.compile("((x*y*)*)|((x*z*)*)|(y*z*)*)")
My logic here was that it would first test for strings with xy then xz then yz. But this is not working unfortunately. It works for my first test string of xyxyxyxyx but for my second string, zyzyzyzy it doesn't match it. Am I using the vertical "or" lines in a wrong way?
Upvotes: 1
Views: 1310
Reputation: 385897
If the string can only contain characters x
, y
and z
:
^([xy]*|[xz]*|[yz]*)$
If the string can contain characters other than x
, y
and z
:
^(?:[^x]+|[^y]+|[^z]+)?$
Partially optimized:
^[^xyz]*(?:[^x]+|[^y]+|[^z]+)?$
Optimized:
^
[^xyz]*
(?: x [^yz]* (?: y [^z]* | z [^y]* )?
| y [^xz]* (?: x [^z]* | z [^x]* )?
| z [^xy]* (?: x [^y]* | y [^x]* )?
)?
$
Fully Optimized: (requires regex rather than re)
^
[^xyz]*+
(?: x [^yz]*+ (?: y [^z]*+ | z [^y]*+ )?+
| y [^xz]*+ (?: x [^z]*+ | z [^x]*+ )?+
| z [^xy]*+ (?: x [^y]*+ | y [^x]*+ )?+
)?+
$
Upvotes: 0
Reputation: 59111
I'm not sure quite how you came up with what you've got, but if you want to match a sequence of (only x
s and y
s) or (only x
s and z
s) or (only y
s and z
s) you can use an expression like this:
^([xy]*|[xz]*|[yz]*)$
Character classes (square brackets) are a convenient way to specify "any one of these characters". So [xy]*
means "a sequence of any length composed of only x and y characters".
The ^
and $
(start and end) indicate that the pattern should match your entire string.
Additionally, if you want to prevent ""
(the empty string) being matched, you could replace all the *
with +
.
Upvotes: 1
Reputation: 103884
You need assertion for start / end of word boundaries \b
and then alterations |
between the three different character classes:
\b([xy]+|[zy]+|[xz]+)\b
You can also use a simpler, faster regex \b[xyz]+\b
and combine with Python logic:
[w for w in re.findall(r'\b[xyz]+\b', txt) if len(set(w))<=2]
Upvotes: 0
Reputation: 18611
Use a lookahead to make sure any string containing three (or more) different characters is failed:
^(?!.*(.).*(?!\1)(.).*(?!\1|\2).)[xyz]+$
See proof
Python:
regex = r"^(?!.*(.).*(?!\1)(.).*(?!\1|\2).)[xyz]+$"
Explanation
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
\1 what was matched by capture \1
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\2 what was matched by capture \2
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
[xyz]+ any character of: 'x', 'y', 'z' (1 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
Upvotes: 1