Francisco Aguilera
Francisco Aguilera

Reputation: 3482

How to compose regexes in code

I am writing regex for the IRC protocol abnf message format. The following is a short example of some of the regex I am writing.

// digit      =  %x30-39                 ; 0-9
// "[0-9]"
static const std::string digit("[\x30-\x39]");

I use previous definitions to form more complex ones, and this gets very complex, fast. Where I am having problems with, especially with more complicated regexes, is composing them:

// hexdigit = digit / "A" / "B" / "C" / "D" / "E" / "F"
// "[[0-9]ABCDEF]"
static const std::string hexdigit("[" + digit + "ABCDEF]");

A "hexdigit" is a "digit" or "hex-letter".

Note: I don't care that the RFC defines a "hexdigit" letter (ABCDEF) as only being uppercase. I am just going with what the RFC says and I don't plan on changing their requirements.

const std::regex digit(dapps::regex::digit);
assert(std::regex_match("0", digit));
assert(std::regex_match("1", digit));
assert(std::regex_match("2", digit));
assert(std::regex_match("3", digit));
assert(std::regex_match("4", digit));
assert(std::regex_match("5", digit));
assert(std::regex_match("6", digit));
assert(std::regex_match("7", digit));
assert(std::regex_match("8", digit));
assert(std::regex_match("9", digit));
assert(!std::regex_match("10", digit));

In the code above, matching a "digit" works as was intended in the abnf.

However, "hexdigit" is now illegal regex syntax:

[[0-9]ABCDEF]

Rather than

[0-9ABCDEF]

and trying to match with it won't work:

const std::regex hexdigit(dapps::regex::hexdigit);
assert(std::regex_match("0", hexdigit));
assert(std::regex_match("1", hexdigit));
assert(std::regex_match("2", hexdigit));
assert(std::regex_match("3", hexdigit));
assert(std::regex_match("4", hexdigit));
assert(std::regex_match("5", hexdigit));
assert(std::regex_match("6", hexdigit));
assert(std::regex_match("7", hexdigit));
assert(std::regex_match("8", hexdigit));
assert(std::regex_match("9", hexdigit));
assert(std::regex_match("A", hexdigit));
assert(std::regex_match("B", hexdigit));
assert(std::regex_match("C", hexdigit));
assert(std::regex_match("D", hexdigit));
assert(std::regex_match("E", hexdigit));
assert(std::regex_match("F", hexdigit));
assert(!std::regex_match("10", hexdigit));

Consequently, if I make "digit" not have the "single character in range selector", ([ ]) then you can't use "digit" to match a "digit".

I may just be going about this the wrong way entirely, so my question is: Do I really need to keep both versions, the one with and without brackets, or is there an easier way altogether to compose regexes.

Upvotes: 2

Views: 1318

Answers (3)

Luis Colorado
Luis Colorado

Reputation: 12708

To get to the general format of an IRC message (no v3, as I see you don't consider tagged messages from v3) you can use this simple regexp:

^\s*(:[^ \n:]* )?([A-Za-z0-9]*)( [^ \n:]*)?( [^ \n:]*)?( [^ \n:]*)?( [^ \n:]*)?( [^ \n:]*)?( [^ \n:]*)?(:.*)?

See demo

It allows you to dissect the message contents into its parts, allowing up to six different parameters to be matched and the catchall final one, preceded by :.

Upvotes: -1

Adrian Shum
Adrian Shum

Reputation: 40056

I am not sure if I read your question right. If your concern is the "duplicated patterns" constants, you can do it by:

static const std::string digit("0-9");
static const std::string hexdigit(digit + "ABCDEF");
static const std::string digit_range("[" + digit + "]");
static const std::string hexdigit_range("[" + hexdigit + "]");

or just keep the first 2, and have a util method like this (psuedo code):

static const std::string digit("0-9");
static const std::string hexdigit(digit + "ABCDEF");

string range_of(string... ranges) {
    string result = "[";
    for each range in ranges {
        result += range
    }
    result += "]";
    return result;
}

so that you can have different kind of range constants defined, and use by std::regex pattern(range_of(hexdigit)); or even something like std::regex pattern(range_of(digit, uppercase_alphabet, normal_punctuation));

Upvotes: 1

Bohemian
Bohemian

Reputation: 425308

Rather than meld the two character classes as you have attempted, which should have been:

[0-9ABCDEF]

construct an alternation - ie a logical OR - via the pipe char |, and bracket (non-grouping) the joined terms:

(?:[0-9]|[ABCDEF])

The benefit of this approach is you can join any two expressions this way, character class or otherwise, eg a digit or a whitespace:

(?:[0-9]|\s)

so it can be very generally applied.


Minor point: You can code [ABCDEF] as [A-F] and/or can make it case insensitive with [A-Fa-f].

Upvotes: 2

Related Questions