Reputation: 8081
Prologue: Although the set of languages recognized by parsers (context-free grammars) is strictly larger than the one of scanners (regular grammars), most parser generators need a scanner.
(Please don't try to explain the reasons behind it, I know them quite well).
I've seen parsers, that do not require a scanner like
There are certain advantages of using no scanner:
Often, "workarounds" are used like switching the scanner on the parser's request.
Question: Do you know any other scannerless parser generators (any language)? Are these practical in use (or purely academic)? Are there any other approaches than Tomita/GLR?
Answers:
Upvotes: 27
Views: 4119
Reputation: 187
I have written a 'scan-on-demand' parser. It is a kind of compromise between a parser with a scanner and a scanner-less parser.
It allows for 'no reserved key-words', and enables one language to be embedded/nested in another.
A good example of this nesting is the Dot grammar from graphviz https://graphviz.gitlab.io/, where XML/HTML can be embedded in the outer graph language.
You can see a demo of my parser and the Dot grammar at https://info.itemis.com/demo/agl/editor
and more details on the parser can be found here https://medium.com/@dr.david.h.akehurst/a-kotlin-multi-platform-parser-usable-from-a-jvm-or-javascript-59e870832a79
Upvotes: 0
Reputation: 16614
Waxeye: Scannerless parser based on parsing expression grammars (PEGs).
Upvotes: 3
Reputation: 9724
Sorry for a necropost. You can try this out - an embeddable PEG/Packrat implementation for .NET:
http://www.meta-alternative.net/pfdoc.pdf
http://www.meta-alternative.net/mbase.html
Upvotes: 1
Reputation: 202735
Two others:
Bryan Ford's Parsing Expression Grammars (PEG) require no scanner. Efficient, lazy "packrat parser" is optional. I've had nothing but good experience with the Lua LPEG version, which compiles to an efficient bytecode machine. Quite practical.
YAKKER looks very intriguing although it is still clearly in a pre-release state. They are using what they claim to be an efficient variation on Earley's parsing algorithm.
I'm actually a huge fan of scannerless parsers; they simplify configuration enormously. And typical scanner generators, to put it mildly, are not much fun to use. From the man page for Lex:
Finally, I have no personal experience with Elkhound, but the secondhand reports I hear are impressive. I would say there's no question but that some scannerless parser generators are very practical.
Upvotes: 11
Reputation: 95430
Parser generators don't need a scanner. But you are pretty much crazy if you don't use one.
Parsers built by parser generators don't care what you feed them, as long as you call them tokens.
To build use a parser generator without a scanner, simply define your grammar down to the character level, and feed individual characters to the parser as tokens.
The reason this is crazy is that parsing is a more complex activity than lexing. You can build lexers as finite state machines, which translate to machine code pretty much as "compare and jump to next state". For speed, that's really hard to beat. Parser generators construct parsers that do recursive descent predictive parsing (for most LL generators such as ANTLR) or do table lookups by hashing, binary or linear search, etc. So a parser spends a lot more energy on a token than a lexer spends on a character.
If you feed characters to a parser as tokens, it will then spend correspondingly more energy on the character than equivalent lexer will. If you process a lot of input text, this will eventually matter, whether you do it for zillions of small input streams or a few really large ones.
The so-called scannerless GLR parsers suffer from this performance problem, relative to GLR parsers that are designed to use tokens.
My company builds a tool, the DMS Software Reengineering Toolkit which uses a GLR parser (and successfully parses all the common langauges you know and lot of odder ones you don't, because it has a GLR parser). We knew about scannerless parsers and chose not to implement them because of the speed differential; we have a classically styled (but extremely powerful) LEX-like subsystem for defining lexical tokens. In the one case where DMS went nose-to-nose against an XT (a tool with a scannerless GLR parser) based tool processing the same input, DMS appeared to be 10x as fast as the XT package. To be fair, the experiment done was ad hoc and not repeated, but as it matched my suspicions I saw no reason to repeat it. YMMV. And of course, if we want to go scannerless, well, its pretty easy to write a grammer with character terminals, as I have already pointed out.
GLR Scannerless parsers do have another very nice property that won't matter to most people. You can take two separate grammars for a scannerless parser, and literally concatenate them, and still get a parser (often with a lot of ambiguities). This matters a lot when you are building one language embedded inside another. If that's not what you are doing, this is just an academic curiosity.
And, AFAIK, Elkhound isn't scannerless. (I could be wrong on this). (EDIT: 2/10: Looks like I was wrong. Won't be the first time in my life :)
Upvotes: 11
Reputation: 96939
boost::spirit::qi
doesn't need a lexer although you you can use boost::spirit::lex
as a front-end. From the introduction of boost::spirit::lex
:
In fact, Spirit.Qi allows you to write parsers without using a lexer, parsing the input character stream directly, and for the most part this is the way Spirit has been used since its invention.
boost::spirit
(the original one) actually worked as you wanted exactly, but it has been moved to boost::spirit::classic
. spirit::qi
, spirit::lex
are the new design of spirit, so I left the classical version out :)
Upvotes: 3