divs1210
divs1210

Reputation: 736

How to match a single Unicode character in single quotes

My language has single-quoted Unicode character literals like:

'h'
'🙂'

etc.

I'm using the following rule to parse this:

CHAR = "'" (!"'" c:.) "'" { return c; }

This works for ASCII characters, but unfortunately not for Unicode.

How can I modify this to match a single Unicode character like the emoji above?

Upvotes: 1

Views: 117

Answers (3)

Joe Hildebrand
Joe Hildebrand

Reputation: 10414

The input to Peggy is a JS string, which are encoded as UTF-16. Here is your grammar with the rules I've been using to match Unicode characters:

// The inside '@' "plucks" the following value out of the enclosing parens,
// then the outer '@' plucks *that* value as the return of the expression.
// '@'-plucks are not in peg.js, but are in Peggy.
CHAR = "'" @(!"'" @ValidSourceCharacter) "'"

// If you really want to use pegjs, you could do this:
CHAR2 = "'" outer:(!"'" inner:ValidSourceCharacter { return inner }) "'" { return outer }

ValidSourceCharacter
  = SourceCharacterLow
  / SurrogatePair

// Not surrogates
SourceCharacterLow
  = [\u0000-\uD7FF\uE000-\uFFFF]

SurrogatePair
  = $( [\uD800-\uDBFF][\uDC00-\uDFFF] )

Upvotes: 0

divs1210
divs1210

Reputation: 736

I solved this by parsing character literals as strings.

Then, in JS, I spread the string into individual unicode codepoints.

If there are more than 1 codepoints, I throw a parse error.

Otherwise, I pick the first codepoint.

Upvotes: 1

Yukulélé
Yukulélé

Reputation: 17062

That seems to work:

CHAR = "'" c:$([\u0800-\uffff]?.) "'" { return c; }

Upvotes: -1

Related Questions