user7673439
user7673439

Reputation:

Simple Ragel grammar with optional whitespace

Ragel is powerfull machine but I have trouble with 'optional' elements in a grammar. I have simple line with number or strings. The trouble is with whitespace. I dont know how put correctly optional whitespace between ',' and variable. Enter will be every where between token. The end line is ';' or enter. I need using $err() function for error.

This is my test set: good

this , is , a   , test ; and, this,
is,ok

next, trouble
How,produce,good
grammar;
ok

output:

and fail (this not = no ',')(',,' without number or variable)

this not , working
and,
this,, too

when i use this grammar i get separate chars or error on end of line

 whitespace = [ \t\v\f] ;
 enter      = [\r\n] ;
 string     = (alnum | '_')+ ;
 number     = ('+'|'-')?[0-9]+'.'[0-9]+( [eE] ('+'|'-')? [0-9]+ )? ;
 var        = string | number ;
 koniec     = (';' | enter)  ;
 line       = var whitespace* ( ',' whitespace* var )* whitespace* koniec ;
 main := whitespace* ( line )* ;

this is my whole code https://github.com/and09/simple_grammar

Upvotes: 2

Views: 369

Answers (1)

Roman Khimov
Roman Khimov

Reputation: 4977

It's a bit hard to give definitive answers when you don't have a full specification of your grammar, but let's at least try to make your example work the way you want it to and then you should be able to correct it if needed.

So, your full example from Github that has some printing actions in it, actually tells a lot about what's going on in the state machine (the other thing you should be periodically checking with while working with Ragel is state machine graph that it can produce for you). In its initial specification (same as in question) it outputs the following on run:

[this]< >,< >[is]

So it has a problem going into the third variable. Why is that? Well, that's because your line only specifies one ( ',' whitespace* var) element, but if you try to fix that by specifying ( ',' whitespace* var)*, it won't also work because now you're demanding that your var is to be immediately followed by comma on repetition, without any whitespace. Let's try this (actions intentionally removed), moving whitespace into the repeating group:

line = var whitespace* ( ',' whitespace* var whitespace*)* koniec;

Now you get this in the output:

[this]< >,< >[is]< >,< >[a]< >< >< >,< >[test]< >

Which is an obvious improvement. So why it fails now? Well, that's because after your koniec the machine wants to wrap into the next line, but in order to do that it needs to see a var. But we have whitespace after ; in the input instead. So we need to change our definition of line to enable some whitespace in the beginning, but that also makes whitespace redundant in the main, so let's try these definitions:

line = whitespace* var whitespace* ( ',' whitespace* var whitespace*)* koniec;
main:= line*;

Now we have this output:

[this]< >,< >[is]< >,< >[a]< >< >< >,< >[test]< >
< >[and],< >[this]

Which again is better, but still not good enough. Now you can see that it chokes on newline, which actually is a bit unclear moment for me too. You say that

The end line is ';' or enter

Yet you want to get

line(and,this,is,ok)

So let's assume that enter starts a new line unless you have a comma in the end of line. To specify that in the grammar, let's do this:

line = whitespace* var whitespace* ( ',' (whitespace | enter)* var whitespace*)* koniec;

Now you get this in the output:

[this]< >,< >[is]< >,< >[a]< >< >< >,< >[test]< >
< >[and],< >[this],[is],[ok]

Why is it not going further? That's because our line has to have the var but we have an empty line in the input instead. That also raises a question of whitespace-only lines, so let's make our line work with whitespace-only content like this:

line       = whitespace* (var whitespace* ( ',' (whitespace | enter)* var whitespace*)*)? koniec;

And bang! Suddenly you have all the word groups you want in the output. But you also have some excessive lines, that are actually very easy to fix, you just need to move your pisz_enter action from koniec into the line like this:

vargroup   = var whitespace* ( ',' %pisz_przecinek (whitespace | enter)* var whitespace*)* %pisz_enter;
line       = whitespace* vargroup? koniec;

That's it. Two other things I can notice are:

  • you want you number to be something like

    number     = (('+'|'-')?[0-9]+'.'[0-9]+( [eE] ('+'|'-')? [0-9]+ )?) >Poczatek_Napisu %pisz_stala ;
    

    to be printed properly

  • you actually need to redo token extraction to work properly, the reason is that you're reading from file in some fixed-amount chunks and you're currently storing some token start pointer (poczatek_napisu) in your actions. If the token is split between chunks (which can occur with high probability on any file longer than sizeof bufor) you're gonna have a problem (and it's not a FSM problem, the machine will work just fine, it's just what you do in actions), but that's beyond the scope of current question.

Upvotes: 1

Related Questions