Reputation: 43
I am trying to create a regular expression that accepts almost every character on an American keyboard except for a select few characters. This is what I currently have(not all is included):
^[a-zA-Z0-9!~`@#$%\\^]
Now I know the ^
is the first character I have come across that needs an escape in front of it. When I put one \
I get a compilation error (invalid escape sequence). When I run this against a String it completely ignores the ^
rule. Anyone know what I'm doing wrong?
Upvotes: 4
Views: 3531
Reputation: 3333
You only need to escape ^
when you want to match it literally, that is, you want to look for text containing the ^ character.
If you intend to use the ^ with its special meaning (the start of a line/string) then there is no need to escape it. Simply type
"^[a-zA-Z0-9!~`@#$%\\^]"
in your source code. The backslashes towards the end of this regular expression do not matter. You need to type 2 backslashes because of the special meaning of the backslash in Java but that has nothing to do with its treatment regular expressions. The regular expression engine receives a single backslash which it uses to read the following character as literal but ^ is a literal within brackets anyway.
To elaborate on your comment about [ and ]:
The brackets have a special meaning in regular expressions as they basically form the boundaries of the character list given by a pattern (the mentioned characters form a so called character class). Let's decompose the regular expression from above to make things clear.
^ Matches the start of the text
[ Opening boundary of your character class
a-z Lower case letters of A to Z
A-Z Upper case letters of A to Z
0-9 Numbers from 0 to 9
! Exclamation mark, literally
~ Tilde, literally
` Backtick, literally
@ The @ character, literally
# Hash, literally
$ Dollar, literally
% Percent sign, literally
\\ Backslash. Regular expression engine only receives single backslash as the other backslash is consumed by Java's syntax for Strings. Would be used to mark following character as literal but ^ is a literal in character class definitions anyway so theses backslashes are ignored.
^ Caret, literally
] Closing boundary of your character class
The order of patterns within the character class definition is irrelevant. The expression above matches matches if the first character of the examined text is part of your character class definition. It depends on how you use the regular expression if the other characters in the examined text matter.
When you start with regular expressions you should always use multiple test texts to match a against and verify the behaviour. It is also advisable to make these test cases a unit test to get high confidence of the correct behaviour of your program.
A simple code sample to test the expression is as follows:
public class Test {
public static void main(String[] args) {
String regexp = "^[ a-zA-Z0-9!~`@#$%\\\\^\\[\\]]+$";
String[] testdata = new String[] {
"abc",
"2332",
"some@test",
"test [ and ] test end",
// Following sample will not match the pattern.
"äöüßµøł"
};
for (String toExamine : testdata) {
if (toExamine.matches(regexp)) {
System.out.println("Match: " + toExamine);
} else {
System.out.println("No match: " + toExamine);
}
}
}
}
Note the I use a modified pattern here. It ensures all characters in the examined string are matching your character class. I did extend the character class to allow for a \ and space and [ and ]. The decomposed description is:
^ Matches the start of the text
[ Opening boundary of your character class
a-z Lower case letters of A to Z
A-Z Upper case letters of A to Z
0-9 Numbers from 0 to 9
! Exclamation mark, literally
~ Tilde, literally
` Backtick, literally
@ The @ character, literally
# Hash, literally
$ Dollar, literally
% Percent sign, literally
\\\\ Backslash, literally. Regular expression engine only receives 2 backslashes as every other backslash is consumed by Java's syntax for Strings. The first backslash is seen as marking the second backslash a occurring literally in the string.
^ Caret, literally
\\[ Opening bracket, literally. The backslash makes the bracket loose its meaning as opening a character class definition.
\\] Closing bracket, literally. The backslash makes the bracket loose its meaning as closing a character class definition.
] Closing boundary of your character class
+ Means any number of characters matching your character class definition can occur, but at least 1 such character needs to be present for a match
$ Matches the start of the text
One thing I don't get though is why one would use the characters of American keyboards as criteria for validation.
Upvotes: 1
Reputation: 30985
You don't have to escape ^
since you are using a character class, just use:
^[a-zA-Z0-9!~`@#$%^]
The character class used by [
...]
allows you to put the characters you want and the special characters are no special anymore within square brackets. The only cases you should escape is if you are using for instance a shortcut range like \d
or \w
, since you are using a backslash in java then you need to escape it as \\d
or \\w
(but just because of java, not the regex engine).
For example:
"a".matches("^[a-zA-Z0-9!~`@#$%^]");
"asdf".matches("^[a-zA-Z0-9!~`@#$%^]+"); // for multiple characters
Upvotes: 7