Reputation: 5992
I've read a book where it states that all fundamental operations in regular expressions are concatatenation
, or(|)
, closure(*)
and parenthesis
to override default precedence. Every other operation is just a shortcut for one or more fundamental operations.
For example, (AB)+
shortcut is expanded to (AB)(AB)*
and (AB)?
to (ε | AB)
where ε
is empty string. First of all, I looked up ASCII table and I am not sure which charcode is designated to empty string. Is it ASCII 0
?
I'd like to figure out how to express the shortcuts ^
and $
as in ^AB
or AB$
expression in the fundamental operations, but I am not sure how to do this. Can you help me out how this is expressed in fundamentals?
Upvotes: 3
Views: 82
Reputation: 85767
Regular expressions, the way they are defined in mathematics, are actually string generators, not search patterns. They are used as a convenient notation for a certain class of sets of strings. (Those sets can contain an infinite number of strings, so enumerating all elements is not practical.)
In a programming context, regexes are usually used as flexible search patterns. In mathematical terms we're saying, "find a substring of the target string S that is an element of the set generated by regex R". This substring search is not part of the regex proper; it's like there's a loop around the actual regex engine that tries to match every possible substring against the regex (and stops when it finds a match).
In fundamental regex terms, it's like there's an implicit .*
added before and after your pattern. When you look at it this way, ^
and $
simply prevent .*
from being added at the beginning/end of the regex.
As an aside, regexes (as commonly used in programming) are not actually "regular" in the mathematical sense; i.e. there are many constructs that cannot be translated to the fundamental operations listed above. These include backreferences (\1
, \2
, ...), word boundaries (\b
, \<
, \>
), look-ahead/look-behind assertions ((?= )
, (?! )
, (?<= )
, (?<! )
), and others.
As for ε: It has no character code because the empty string is a string, not a character. Specifically, a string is a sequence of characters, and the empty string contains no characters.
Upvotes: 7
Reputation: 4013
^AB
can be expressed as (εAB)
ie an empty string followed by AB and AB$
can be expressed as (ABε)
that's AB followed by an empty string.
The empty string is actually defined as ''
, that's a string of 0 length, so has no value in the ASCII table. However the C programming language terminates all strings with the ASCII NULL character, although this is not counted in the length of the string it still must be accounted for when allocating memory.
EDIT As @melpomene pointed out in their comment εAB is equivalent to AB which makes the above invalid. Having talked to a work college I'm no longer sure how to do this or even if it's possible. Hopefully someone can come up with an answer.
Upvotes: 2