Reputation: 3482
I am writing regex for the IRC protocol abnf message format. The following is a short example of some of the regex I am writing.
// digit = %x30-39 ; 0-9
// "[0-9]"
static const std::string digit("[\x30-\x39]");
I use previous definitions to form more complex ones, and this gets very complex, fast. Where I am having problems with, especially with more complicated regexes, is composing them:
// hexdigit = digit / "A" / "B" / "C" / "D" / "E" / "F"
// "[[0-9]ABCDEF]"
static const std::string hexdigit("[" + digit + "ABCDEF]");
A "hexdigit" is a "digit" or "hex-letter".
Note: I don't care that the RFC defines a "hexdigit" letter (ABCDEF) as only being uppercase. I am just going with what the RFC says and I don't plan on changing their requirements.
const std::regex digit(dapps::regex::digit);
assert(std::regex_match("0", digit));
assert(std::regex_match("1", digit));
assert(std::regex_match("2", digit));
assert(std::regex_match("3", digit));
assert(std::regex_match("4", digit));
assert(std::regex_match("5", digit));
assert(std::regex_match("6", digit));
assert(std::regex_match("7", digit));
assert(std::regex_match("8", digit));
assert(std::regex_match("9", digit));
assert(!std::regex_match("10", digit));
In the code above, matching a "digit" works as was intended in the abnf.
However, "hexdigit" is now illegal regex syntax:
[[0-9]ABCDEF]
Rather than
[0-9ABCDEF]
and trying to match with it won't work:
const std::regex hexdigit(dapps::regex::hexdigit);
assert(std::regex_match("0", hexdigit));
assert(std::regex_match("1", hexdigit));
assert(std::regex_match("2", hexdigit));
assert(std::regex_match("3", hexdigit));
assert(std::regex_match("4", hexdigit));
assert(std::regex_match("5", hexdigit));
assert(std::regex_match("6", hexdigit));
assert(std::regex_match("7", hexdigit));
assert(std::regex_match("8", hexdigit));
assert(std::regex_match("9", hexdigit));
assert(std::regex_match("A", hexdigit));
assert(std::regex_match("B", hexdigit));
assert(std::regex_match("C", hexdigit));
assert(std::regex_match("D", hexdigit));
assert(std::regex_match("E", hexdigit));
assert(std::regex_match("F", hexdigit));
assert(!std::regex_match("10", hexdigit));
Consequently, if I make "digit" not have the "single character in range selector", ([ ]
) then you can't use "digit" to match a "digit".
I may just be going about this the wrong way entirely, so my question is: Do I really need to keep both versions, the one with and without brackets, or is there an easier way altogether to compose regexes.
Upvotes: 2
Views: 1318
Reputation: 12708
To get to the general format of an IRC message (no v3, as I see you don't consider tagged messages from v3) you can use this simple regexp:
^\s*(:[^ \n:]* )?([A-Za-z0-9]*)( [^ \n:]*)?( [^ \n:]*)?( [^ \n:]*)?( [^ \n:]*)?( [^ \n:]*)?( [^ \n:]*)?(:.*)?
See demo
It allows you to dissect the message contents into its parts, allowing up to six different parameters to be matched and the catchall final one, preceded by :
.
Upvotes: -1
Reputation: 40056
I am not sure if I read your question right. If your concern is the "duplicated patterns" constants, you can do it by:
static const std::string digit("0-9");
static const std::string hexdigit(digit + "ABCDEF");
static const std::string digit_range("[" + digit + "]");
static const std::string hexdigit_range("[" + hexdigit + "]");
or just keep the first 2, and have a util method like this (psuedo code):
static const std::string digit("0-9");
static const std::string hexdigit(digit + "ABCDEF");
string range_of(string... ranges) {
string result = "[";
for each range in ranges {
result += range
}
result += "]";
return result;
}
so that you can have different kind of range constants defined, and use by std::regex pattern(range_of(hexdigit));
or even something like std::regex pattern(range_of(digit, uppercase_alphabet, normal_punctuation));
Upvotes: 1
Reputation: 425308
Rather than meld the two character classes as you have attempted, which should have been:
[0-9ABCDEF]
construct an alternation - ie a logical OR
- via the pipe char |
, and bracket (non-grouping) the joined terms:
(?:[0-9]|[ABCDEF])
The benefit of this approach is you can join any two expressions this way, character class or otherwise, eg a digit or a whitespace:
(?:[0-9]|\s)
so it can be very generally applied.
Minor point: You can code [ABCDEF]
as [A-F]
and/or can make it case insensitive with [A-Fa-f]
.
Upvotes: 2