Benji Cooper
Benji Cooper

Reputation: 59

OCaml regex being buggy when trying to use escape characters

I'm trying to write a lexer for a variation on C using OCaml. For the lexer I need to match the strings "^" and "||" (as the exponentiation and or symbols respectively). Both of these are special characters in regex, and when I try to escape them using the backslash, nothing changes and the code runs as if "\^" was still beginning of line and "\|\|" was still "or or". What can I do to fix this?

Upvotes: 2

Views: 1995

Answers (1)

dkim
dkim

Reputation: 3970

Backslash characters in string literals have to be doubled to make them past the OCaml string parser:

# let r = Str.regexp "\\^" in
    Str.search_forward r "FOO^BAR" 0;;
- : int = 3        

If you are using OCaml 4.02 or later, you can also use quoted strings ({| ... |}), which do not handle a backslash character specially. This may result in more readable code because backslash characters do not have to be doubled:

# let r = Str.regexp {|\^|} in
    Str.search_forward r "FOO^BAR" 0;;
- : int = 3

Or you may consider using Str.regexp_string (or Str.quote), which creates a regular expression that will match all characters in its argument literally:

# let r = Str.regexp_string "^" in
    Str.search_forward r "FOO^BAR" 0;;
- : int = 3

The Str module does not take | as a special regex character, so you do not have to worry about quoting when you want to use it literally:

# let r = Str.regexp "||" in
    Str.search_forward r "FOO||BAR" 0;;
- : int = 3

| has to be quoted only when you want to use it as the "or" construct:

# let r = Str.regexp "BAZ\\|BAR" in
    Str.search_forward r "FOOBAR" 0;;
- : int = 3

You might want to refer to Str.regexp for the full syntax of regular expressions.

Upvotes: 7

Related Questions