jrm
jrm

Reputation: 609

Which encoding? Character strings enclosed by tilde ~ and curly braces {}

The APNIC Whois database contains a lot of entries for Chinese entities with some sort of encoding enclosed by ~{ and ~}. For example:

$ whois 211.68.92.0 | grep ^descr:
descr:          ~{146{J5QiJR;y4!?FQ'QP>?T:~}(~{VP9z~})
descr:          Bell Labs Research China
descr:          Beijing 100080, China

Anyone know what this is? An encoding of some sort? My first guess was Punycode but quickly realised that it wouldn't include some of the special characters that are in there.

I also found this encoding on some web pages, like that.

Would be interesting to decode this, out of curiosity.

EDIT: Found in RFC 1842.

For an arbitrary mixed text with both Chinese coded text strings and ASCII text strings, we designate to two distinguishable text modes, ASCII mode and HZ mode, as the only two states allowed in the text. At any given time, the text is in either one of these two modes or in the transition from one to the other. In the HZ mode, only printable ASCII characters (0x21-0x7E) are meanful with the size of basic text unit being two bytes long.

In the ASCII mode, the size of basic text unit is one (1) byte with the exception '~~', which is the special sequence representing the ASCII character '~'. In both ASCII mode and HZ mode, '~' leads an escape sequence. However, as HZ mode has basic size of text unit being 2 bytes long, only the '~' character which appears at the first byte of the the two-byte character frame are considered as the start of an escape sequence.

The default mode is ASCII mode. Each line of text starts with the default ASCII mode. Therefore, all Chinese character strings are to be enclosed with '~{' and '~}' pair in the same text line.

The escape sequences defined are as the following:

~{       ---- escape from ASCII mode to GB2312 HZ mode
~}       ---- escape from HZ mode to ASCII mode
~~       ---- ASCII character '~' in ASCII mode
~\n      ---- line continuation in ASCII mode
~[!-z|]  ---- reserved for future HZ mode character sets

A few examples of the 7 bit representation of Chinese GB coded test taken directly from [Lee89] are listed as the following:

Example 1: (Suppose there is no line size limit.) This sentence is in ASCII. The next sentence is in GB.~{<:Ky2;S{#,NpJ)l6HK!#~}Bye.

Example 2: (Suppose the maximum line size is 42.) This sentence is in ASCII. The next sentence is in GB.~{<:Ky2;S{#,~}~ ~{NpJ)l6HK!#~}Bye.

Example 3: (Suppose a new line is started for every mode switch.) This sentence is in ASCII. The next sentence is in GB.~ ~{<:Ky2;S{#,NpJ)l6HK!#~}~ Bye.

How would I go about decoding this in python3?

Upvotes: 3

Views: 633

Answers (1)

snakecharmerb
snakecharmerb

Reputation: 55589

As the OP discovered, the encoding is the HZ encoding for mixed ASCII and Chinese text defined in RFC1842.

The codecs module in the standard library provides this encoding as 'hz', aliased as 'hzgb', 'hz-gb', and 'hz-gb-2312'.

>>> s = "~{146{J5QiJR;y4!?FQ'QP>?T:~}(~{VP9z~})"
>>> bs = s.encode('ascii')
>>> bs.decode('hz')  
'贝尔实验室基础科学研究院(中国)'

Upvotes: 4

Related Questions