Reputation: 2817
I'm trying to implement a lexer in Ocaml. Part of what I need to do at first is take a string and split them into list of strings, so that I can later "tokenize" them and put it into a parser. It needs to follow basic operational guidelines by ignoring spaces, tabs, newlines...etc. For example:
"1 + 25 *(6^2)"
should return
["1"; "+"; "25"; "*"; "("; "6"; "^"; "2"; ")"]
If the beginning of a string could be multiple things, the longest match should be preferred, for example:
"1-1" should be split as ["1"; "-1"] since "-1" is a longer match than just "-"
I'm trying to do this first step with Str.regex
but it's not powerful enough to split it perfectly. My code:
Str.split (Str.regexp "[ \t\n]+") input
takes input
and split them according to [ \t\n]+
, so the issue here is, if I have something like (5 + 6^8)
, it'll return ["(5"; "+"; "6^8)"]
instead of ["("; "5"; "+"; "6"; "^"; "8"; ")"]
.
Any idea how I could do this better?
Upvotes: 0
Views: 624
Reputation: 66823
This is what ocamllex
is for. You'll need a more explicit list of lexical structures, rather than imagining just splitting on whitespace.
As a side comment, be sure to read the section that describes what regular expression constructs are supported. A common problem is to try to use unsupported constructs from other languages.
For what it's worth, it is tricky to handle negative numbers at the lexical level, because you usually want to support things like "x-1". If you try to handle negative numbers lexically, this comes out as two tokens.
Update
If you can't use ocamllex, you still need to think in terms of a set of regular expressions.
If you can use the Str module, you can use Str.regexp
to create the same set of regular expressions you would have used with ocamllex. To get the next token, match all of the regular expressions and take the longest match. (To break ties on length, order the regular expressions and take the first match of the longest length.)
Upvotes: 1