Reputation: 9140
I have a simple language which consists of patterns like
size(50*50)
start(10, 20, -x)
forward(15)
stop
It's an example of turtle-drawing language. I need to properly tokenize it. The above is a source code instance. Statements and expressions are separated with newlines. I set up my Scanner to use delimiters like newlines. I expect next("start")
to eat the string "start", and then I issue next("(")
to eat the first parenthesis. It appears however, that it does something else than I expect. Has the scanner already broken the above into tokens based on delimiter and/or do I need to approach this differently? For me, "start", "(", "50", "*", "50" and ")" on the first line would constitute separate tokens, which appears to be an unfulfilled expectation here. How can I tokenize the above with as little code as possible? I don't currently need to write a tokenizer, I am writing an interpreter, so tokenizing is something I don't want to spend my time on currently, I just like Scanner to work with me here.
My useDelimiter
call is as follows:
Scanner s ///...
s.useDelimiter(Pattern.compile("[\\s]&&[^\\r\\n]"));
Issuing first next
call gives me the entire file contents. Without the above call, it gives me entire first line.
Upvotes: 1
Views: 1226
Reputation: 205785
The class java.io.StreamTokenizer
may be a better fit. It is used in this example of a recursive descent parser.
Addendum: What is the principal difference between the StreamTokenizer
and Scanner
classes?
Either can do the lexical analysis required by a parser. StreamTokenizer
is lighter weight but limited to four, pre-defined meta-tokens. Scanner
is considerably more flexible, but somewhat more cumbersome to use. Here's a comparison of the two and variation on the latter.
Upvotes: 2
Reputation: 47609
To write a proper parser, you need to define your language in a formal grammar. Trust me, you want to do it properly or you will have problems downstream.
You can probably represent your tokens as regular expressions at the lowest level, but first you need to be clear about your grammar, which is combinations of tokens in lexical structures. You can represent this as recursive functions (methods), known as Productions. Each Production function can use scanner to test whether or not it is looking at a token it wants. But scanner will consume the input and you can't reverse.
If you used Scanner, you will find the following things unsuitable:
It will always parse a token according to the regular expression,
1.1 so even if you do get a token you can use, you will have to write more code to decide exactly what token it was
1.2 and you may not be able to represent your language grammar as one big expression
I suggest you write the character lexer yourself, and iterate over a string / array of chars rather than a stream. Then you can re-wind.
Otherwise, use a ready-built lexer/parser framework like yacc or Coco/R.
Upvotes: 3