RandomB
RandomB

Reputation: 3747

UTF-8, Unicode in SML/NJ

I am playing with SML/NJ (version 110.99.4) on Windows 10.

I have a structure containing a text file in UTF-8 encoding:

...
    let
      val s:string = "søk"
    in
      print s
    end;
...

My console has 65001 code page (which is UTF-8) - chcp reports it. This code prints søk. Then, I have 3 questions:

  1. As I know SML/NJ has widestring (and widechar) type for Unicode, but it's optional for Windows (actually it is missing), I supposed that string is ASCII string, but it seems it is not. So, what is string type? Codepoints? UTF-8?
  2. How portable is this string from SML/NJ? Can I use it everywhere (on Linux, for example) where I want UTF-8?
  3. Is this behavior of string similar for all SML implementations?

PS. Also my SML/NJ version has UTF8 structure (open UTF8 passes). It recalls wchar. But I see that string allows to print non-ASCII strings correctly. At the same time the structure String recalls char. It confuse me more even: what does string contains: wchar or char (but UTF8)? Then what is the missing widechar?

PPS. Attempt to enter non-ASCII string in sml.bat repl's session failed with:

stdIn:2.10 Error: illegal non-printing character in string
stdIn:2.11 Error: illegal non-printing character in string
stdIn:2.12 Error: illegal non-printing character in string
...

Sorry, for so many questions, I would appreciate any clarification about the state of Unicode, UTF-8 in the world of Standard ML (and SML/NJ) and convenient ways to work with them.

Upvotes: 1

Views: 122

Answers (1)

RandomB
RandomB

Reputation: 3747

I found for instance, such library: https://github.com/cannam/sml-utf8 which defines WdString. It allows to encode/decode to/from UTF8/wide-string and other "standard" (for SML) string operations. I tried it with SML/NJ and it seems it works.

Upvotes: 1

Related Questions