Jackson Tale
Jackson Tale

Reputation: 25812

How to parse a string to a regex type in OCaml

We define a regex type like this:

type regex_t =
    | Empty_String
    | Char of char
    | Union of regex_t * regex_t 
    | Concat of regex_t * regex_t 
    | Star of regex_t 

We want to write a function string_to_regex: string -> regex_t.

For example,

(a|E)*(a|b) will be

Concat(Star(Union(Char 'a',Empty_String)),Union(Char 'a',Char 'b'))

How to implement string_to_regex?

Upvotes: 2

Views: 1921

Answers (2)

greps
greps

Reputation: 101

The comment by @Thomas is very complete, but actually the precedence of the operators is not correct: parsing a|aa will result in Concat(Choice(Char a, Char a), Char a), namely equals (a|a)a. The regular expression operators precedence requires that a|aa = a|(aa), which should then result in Choice(Char a, Concat(Char a, Char a)). The problem is that the CONCAT token is a hack, and even if you specify %left CHOICE before %left CONCAT in the parser.mly file, this doesn't imply that the precedence will be respected. One possible solution is to perform recursive descending parsing, effectively making the grammar unambiguous. If you want to use this approach you can modify parser.mly with:

%{ open Ast %}

%token <char> CHAR
%token EMPTY
%token STAR
%token CHOICE
%token LPAR
%token RPAR
%token EOF

%start main
%type <Ast.regex_t> main

%%

main: r = regex EOF { r }

regex:
| r = disjunction { r }

disjunction:
| a = disjunction CHOICE b = concat { Choice(a, b) }
| r = concat {r}

concat:
| a = concat b = repetition { Concat(a, b) } 
| r = repetition {r}

repetition:
| r = repetition STAR { Star r }
| r = atom { r }

atom:
| LPAR r = regex RPAR { r }
| c = CHAR { Char c }
| EMPTY { Empty }

This leads to no ambiguity (= no need to specify operators associativity and precedence) and will produce the correct result.

Upvotes: 0

Thomash
Thomash

Reputation: 6379

Ocamllex and menhir are wonderful tools to write lexers and parsers

ast.mli

type regex_t =
| Empty
| Char of char
| Concat of regex_t * regex_t
| Choice of regex_t * regex_t
| Star of regex_t

lexer.mll

{ open Parser }

rule token = parse
| ['a'-'z'] as c { CHAR c }
| 'E' { EMPTY }
| '*' { STAR }
| '|' { CHOICE }
| '(' { LPAR }
| ')' { RPAR }
| eof { EOF }

parser.mly

%{ open Ast %}

%token <char> CHAR
%token EMPTY STAR CHOICE LPAR RPAR CONCAT
%token EOF

%nonassoc LPAR EMPTY CHAR

%left CHOICE
%left STAR
%left CONCAT

%start main
%type <Ast.regex_t> main

%%

main: r = regex EOF { r }

regex:
| EMPTY { Empty }
| c = CHAR { Char c }
| LPAR r = regex RPAR { r }
| a = regex CHOICE b = regex { Choice(a, b) }
| r = regex STAR { Star r }
| a = regex b = regex { Concat(a, b) } %prec CONCAT

main.ml

open Ast

let rec format_regex = function
| Empty -> "Empty"
| Char c -> "Char " ^ String.make 1 c
| Concat(a, b) -> "Concat("^format_regex a^", "^format_regex b^")"
| Choice(a, b) -> "Choice("^format_regex a^", "^format_regex b^")"
| Star(a) -> "Star("^format_regex a^")"

let () =
  let s = read_line () in
  let r = Parser.main Lexer.token (Lexing.from_string s) in
  print_endline (format_regex r)

and to compile

ocamllex lexer.mll
menhir parser.mly
ocamlc -c ast.mli
ocamlc -c parser.mli
ocamlc -c parser.ml
ocamlc -c lexer.ml
ocamlc -c main.ml
ocamlc -o regex parser.cmo lexer.cmo main.cmo

and then

$ ./regex
(a|E)*(a|b)
Concat(Star(Choice(Char a, Empty)), Choice(Char a, Char b))

Upvotes: 6

Related Questions