Vee
Vee

Reputation: 776

(Tcl) what character encoding set should I use?

So I'm trying to open and parse some old Visual Studio compilation log files with Tcl; my only problem is the files are in a strange encoding. Upon examining them with Notepad++ it seems they are in the 'UCS-2 Little Endian' encoding. Two questions:

Thanks!

Upvotes: 2

Views: 3046

Answers (2)

Arash Jafari
Arash Jafari

Reputation: 11

In Tcl v8.6.8 I could solve the same issue with fconfigure channelId -encoding unicode.

Upvotes: 1

kostix
kostix

Reputation: 55483

I'm afraid, currently there's no way to do it just by using fconfigure -encoding ?something?: the unicode encoding has rather moot meaning, and there's a feature request to create explicit support for UTF-16 variants.

What you could do about it?

Since unicode in Tcl running on Windows should mean UTF-16 with native endianness1 (little-endian on Wintel), if your solution is supposed to be a quick and dirty one, just try using -encoding unicode and see if that helps.

If you're targeting at some more bullet-proof or future-proof of cross-platform solution, I'd switch the channel to binary more, read the contents in chunks of two bytes at a time, and then use

binary scan $twoBytes s n

to scan the sequence of two bytes in $twoBytes as an 16-bit integer into a variable named "n", followed by something like

set c [format %c $n]

to produce a unicode character out of the number in $n, and assign it to a variable.

This way supposedly requires a bit more trickery to get correctly:

  • You might check the very first character obtained from the stream to see if it's a byte-order-mark, and drop it if it is.
  • If you need to process the stream in a line-wise manner, you'd have to implement a little state machine that would handle the CR+LF sequences correctly.
  • When doing your read $channelId 2, to get the next character, you should check that it returned not just 0 or 2, but also 1 — in case the file happens to be corrupted, — and handle this.

The UCS-2 encoding differs from UTF-16 in that the latter might contain the so-called surrogate pairs, and hence it is not a fixed-length encoding. Hence handling an UTF-16 stream properly implies also detecting those surrogate pairs. On the other hand, I hardly beleive a compilation log produced by MSVS might contain them, so I'd just assume it's encoded in UCS-2LE.


1 The true story is that the only thing Tcl guarantees about textual strings it handles (that is, those obtained by maniputating text, not via binary format or encoding convertto or reading a stream in binary mode) is that they're Unicode (or, rather, the "BMP" part of it). But technically, the interpreter might switch the internal representation of any string between the UTF-8 encoding it uses by default and some fixed-length encoding which is what is referred to by that name "unicode". The "problem" is that no part of Tcl documentation specifies that internal fixed-length encoding because you're required to explicitly convert any text you output or read to/from some specific encoding — either via configuring the stream or using encoding convertfrom and encoding convertto or using binary format and binary scan, and the interpreter will do the right thing no matter which precise encoding it's currently using for your source string value — it's all transparent. Moreover, the next release of the "standard" Tcl interpreter might decide to drop this internal feature completely, or, say, use 32-bit or 64-bit integers for that internal fixed-length encoding. Whatever "non-standard" interpreters do (like Jacl etc) are also up to them. In other words, this feature is internal and is not a part of the documented contract about the interpreter's behaviour. And by the way, the "standard" encoding for Tcl strings (UTF-8) is not specified as such either — it's just an implementation detail.

Upvotes: 7

Related Questions