Reputation: 25812
We define a regex type like this:
type regex_t =
| Empty_String
| Char of char
| Union of regex_t * regex_t
| Concat of regex_t * regex_t
| Star of regex_t
We want to write a function string_to_regex: string -> regex_t
.
Empty_string
is 'E'Char
are 'a'..'z'Union
Star
Concat
is assumed for continuous parsing.For example,
(a|E)*(a|b)
will be
Concat(Star(Union(Char 'a',Empty_String)),Union(Char 'a',Char 'b'))
How to implement string_to_regex
?
Upvotes: 2
Views: 1921
Reputation: 101
The comment by @Thomas is very complete, but actually the precedence of the operators is not correct: parsing a|aa
will result in Concat(Choice(Char a, Char a), Char a)
, namely equals (a|a)a
. The regular expression operators precedence requires that a|aa = a|(aa)
, which should then result in Choice(Char a, Concat(Char a, Char a))
. The problem is that the CONCAT
token is a hack, and even if you specify %left CHOICE
before %left CONCAT
in the parser.mly
file, this doesn't imply that the precedence will be respected. One possible solution is to perform recursive descending parsing, effectively making the grammar unambiguous. If you want to use this approach you can modify parser.mly
with:
%{ open Ast %}
%token <char> CHAR
%token EMPTY
%token STAR
%token CHOICE
%token LPAR
%token RPAR
%token EOF
%start main
%type <Ast.regex_t> main
%%
main: r = regex EOF { r }
regex:
| r = disjunction { r }
disjunction:
| a = disjunction CHOICE b = concat { Choice(a, b) }
| r = concat {r}
concat:
| a = concat b = repetition { Concat(a, b) }
| r = repetition {r}
repetition:
| r = repetition STAR { Star r }
| r = atom { r }
atom:
| LPAR r = regex RPAR { r }
| c = CHAR { Char c }
| EMPTY { Empty }
This leads to no ambiguity (= no need to specify operators associativity and precedence) and will produce the correct result.
Upvotes: 0
Reputation: 6379
Ocamllex and menhir are wonderful tools to write lexers and parsers
ast.mli
type regex_t =
| Empty
| Char of char
| Concat of regex_t * regex_t
| Choice of regex_t * regex_t
| Star of regex_t
lexer.mll
{ open Parser }
rule token = parse
| ['a'-'z'] as c { CHAR c }
| 'E' { EMPTY }
| '*' { STAR }
| '|' { CHOICE }
| '(' { LPAR }
| ')' { RPAR }
| eof { EOF }
parser.mly
%{ open Ast %}
%token <char> CHAR
%token EMPTY STAR CHOICE LPAR RPAR CONCAT
%token EOF
%nonassoc LPAR EMPTY CHAR
%left CHOICE
%left STAR
%left CONCAT
%start main
%type <Ast.regex_t> main
%%
main: r = regex EOF { r }
regex:
| EMPTY { Empty }
| c = CHAR { Char c }
| LPAR r = regex RPAR { r }
| a = regex CHOICE b = regex { Choice(a, b) }
| r = regex STAR { Star r }
| a = regex b = regex { Concat(a, b) } %prec CONCAT
main.ml
open Ast
let rec format_regex = function
| Empty -> "Empty"
| Char c -> "Char " ^ String.make 1 c
| Concat(a, b) -> "Concat("^format_regex a^", "^format_regex b^")"
| Choice(a, b) -> "Choice("^format_regex a^", "^format_regex b^")"
| Star(a) -> "Star("^format_regex a^")"
let () =
let s = read_line () in
let r = Parser.main Lexer.token (Lexing.from_string s) in
print_endline (format_regex r)
and to compile
ocamllex lexer.mll
menhir parser.mly
ocamlc -c ast.mli
ocamlc -c parser.mli
ocamlc -c parser.ml
ocamlc -c lexer.ml
ocamlc -c main.ml
ocamlc -o regex parser.cmo lexer.cmo main.cmo
and then
$ ./regex
(a|E)*(a|b)
Concat(Star(Choice(Char a, Empty)), Choice(Char a, Char b))
Upvotes: 6