Reputation: 736
My language has single-quoted Unicode character literals like:
'h'
'🙂'
etc.
I'm using the following rule to parse this:
CHAR = "'" (!"'" c:.) "'" { return c; }
This works for ASCII characters, but unfortunately not for Unicode.
How can I modify this to match a single Unicode character like the emoji above?
Upvotes: 1
Views: 117
Reputation: 10414
The input to Peggy is a JS string, which are encoded as UTF-16. Here is your grammar with the rules I've been using to match Unicode characters:
// The inside '@' "plucks" the following value out of the enclosing parens,
// then the outer '@' plucks *that* value as the return of the expression.
// '@'-plucks are not in peg.js, but are in Peggy.
CHAR = "'" @(!"'" @ValidSourceCharacter) "'"
// If you really want to use pegjs, you could do this:
CHAR2 = "'" outer:(!"'" inner:ValidSourceCharacter { return inner }) "'" { return outer }
ValidSourceCharacter
= SourceCharacterLow
/ SurrogatePair
// Not surrogates
SourceCharacterLow
= [\u0000-\uD7FF\uE000-\uFFFF]
SurrogatePair
= $( [\uD800-\uDBFF][\uDC00-\uDFFF] )
Upvotes: 0
Reputation: 736
I solved this by parsing character literals as strings.
Then, in JS, I spread the string into individual unicode codepoints.
If there are more than 1 codepoints, I throw a parse error.
Otherwise, I pick the first codepoint.
Upvotes: 1
Reputation: 17062
That seems to work:
CHAR = "'" c:$([\u0800-\uffff]?.) "'" { return c; }
Upvotes: -1