Java, JavaCC: How to parse characters outside the BMP?

Question

Look at the definition of NameStartChar:

If I interpret this correctly, the last range (#x10000-#xEFFFF) goes beyond the UTF16 range of Java's char type. So it must be UTF32, right? So, I need to check pairs of char against this range, instead of single chars, right?

My questions are:

How do I check for such character ranges using standard Java methods?
How is it possible to define such ranges in JavaCC?
- JavaCC complains about \u10000 and \uEFFFF

Thank you!

NOTE: Don't worry, I am not trying to write an own XML-parser.
EDIT: I am writing a parser, which would check if text input from miscellaneous (non-XML) text formats would match valid XML names.

Jon Skeet · Accepted Answer

Have a look at Character.toCodePoint(char, char) which will convert a surrogate pair into a full range code point. String.codePointAt may well be useful to you, too.

There's a lot of other surrogate support within Character and String. To know exactly which methods to call, we'd need to know the exact details of your situation.

Java, JavaCC: How to parse characters outside the BMP?

Answers (2)

Related Questions