SoniEx2
SoniEx2

Reputation: 2044

How do I convert a PCRE to EBNF?

I'm writing a spec thingamajig and I don't know EBNF. I have the following PCRE:

^(?:\$(?:\$|{\d+})|[^$])*$

Where, in the input:

And I need to convert it to EBNF. How do I convert this PCRE to EBNF?

(I noticed there are many questions about going from EBNF to PCRE, but I haven't seen any about going the other way around)

Upvotes: 0

Views: 422

Answers (1)

rici
rici

Reputation: 241791

Two things make answering this apparently simple question complicated:

  1. The term "EBNF" has a large variety of manifestations. There is ISO standard ISO/IEC 14977:1996 for "Extended BNF" but as far as I know, it is rarely used in practice. (Note: There is a free download link on that page; purchase is not necessary.) Many internet protocols use "Augmented BNF" as defined by RFC 5234, which is probably a better fit for your particular problem. And there are lots of parser generators which extend BNF in various ways, generally by adding regular-expression-like repetition and optionality operators, without being standardized in any way. (In fact, it was this chaos of possible definitions that motivated the ISO to produce a standard, but as is often the case with ISO standards, the lack of free text access -- until a decade after its release -- and freely-available tools hindered adoption.)

  2. Regular expressions do not necessarily produce unambiguous grammars, and the regular expression you provide is ambiguous, since $ is allowed to be used as an ordinary character. The implication (and, I'm certain, the intention) is that a $ may not be treated as a regular character if it is followed by another $ or a number surrounded by braces, but the regular expression itself does not (and does not need to) make that distinction. Less obvious is what the intention might be for a string like:

     ${42 looks like an error to me but it would be accepted by the regex.
    

Anyway, here is an ISO EBNF for something similar to your language. Note that it does not accept the above string.

(* EBNF does not have wildcard characters and there is no way to 
   enumerate all possible characters, so I use the exception mechanism
   to describe the set
 *)
any character
  = ? Any character representable by the source character encoding ? ;
decimal digit
  = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';
literal sequence
  = {any character} - 
      ({any character}, ('$$' | '${'), {any character}) ;
escaped dollar
  = '$$' ;
parameter
  = '${', decimal digit, {decimal digit}, '}';
thingamajig
  = {literal sequence | escaped dollar | parameter}

On the whole, since you provide a mechanism for escaping dollar signs, it would probably be simpler to just ban the use of loose dollar signs. That makes both the specification and the parser simpler, and avoids the problem of non-canonical representations. (Non-canonical representations can be a security problem because round-tripping a string into an internal representation and back could then fail fingerprint validation, and because they allow for information leaking. Those may not be significant in this case, but in general best practice for data exchange protocols is to avoid non-canonical representations when possible.)

Upvotes: 1

Related Questions