Reputation: 9150
I am not sure I've got my nomenclature right, so please correct me :)
I've received a text file representing a Pāli dictionary: a list of words separated by newline \n
(0x0a) characters. Supposedly, some of the special letters are encoded using UTF-8, but I doubt that.
Loading this text file into any of my editors (vim, Notepad, TextEdit, ..) shows quite scrambled text, for example
mhiti
A closer look at the actual bytes then reveal the following (using hexdump -C
)
0a 0a 1e 6d 68 69 74 69 0a 0a ...mhiti..
which seems to me the Unicode code point U+1E6D
("ṭ" or LATIN SMALL LETTER T WITH DOT BELOW). That particular letter has UTF-8 encoding e1 b9 ad
.
My question: is there a tool which helps me convert this particular file into actual UTF-8 encoding? I tried iconv
but without success; I looked briefly into a Python script but would think there's an easier way to get this done. It seems that this is a useful link for this problem, but isn't there a tool that can get this done? Am I missing something?
EDIT: Just to make things a little bit more entertaining, there seem to be actual UTF-8 encoded characters scattered throughout as well. For example, the word "ākiñcaññāyatana" has the following sequence of bytes
01 01 6b 69 c3 b1 63 61 c3 b1 c3 b1 01 01 79 61 74 61 6e 61
ā k i ñ c a ñ ñ ā y a t a n a
where the "ā" is encoded by its Unicode code point U-0101, and the "ñ" is encoded by the UTF-8 sequence \xc3b1 which has Unicode code point U-00F1.
EDIT: Here's one that I can't quite figure out what it's supposed to be:
01 1e 37 01 01 76 61 6b 61
? ā v a k a
I can only guess, but that too doesn't make sense. The Unicode code point U+011e is a "Ğ" (UTF-8 \xc49e) but that's not a Pāli character AFAIK; then a "7" follows which doesn't make sense in a word. Then the Unicode code point U+1E37 is a "ḷ" (UTF-8 \xe1b8b7) which is a valid Pāli character. But that would leave the first byte \x01 by itself. If I had to guess I would think this is the name "Jīvaka" but that would not match the bytes. LATER: According to the author, this is "Āḷāvaka" — so assuming the heuristics of character encoding from above, again a \x00 is missing. Adding it back in
01 00 1e 37 01 01 76 61 6b 61
Ā ḷ ā v a k a
Are there "compressions" that remove \x00 bytes from UTF-16 encoded Unicode files?
Upvotes: 2
Views: 2281
Reputation: 9150
I think in the end this was my own fault, somehow. Browsing to this file showed a very mangled and broken version of the original UTF-16 encoded file; the "Save as" menu from the browser then saved that broken file which created the initial question for this thread.
It seems that a web browser tries to display that UTF-16 encoded file, removes non-printable characters like \x00 and converts some others to UTF-8, thus completely mangling the original file.
Using wget
to fetch the file fixed the problem, and I could convert it nicely into UTF-8 and use it further.
Upvotes: 1
Reputation: 31110
I'm assuming in this context that "ṭhiti" makes sense as the contents of that file.
From your description, it looks like that file encodes characters < U+0080 as a single byte and characters > U+0100 as two-byte big-endian. That's not decodable, in general; two linefeeds (U+000A, U+000A) would have the same encoding as GURMUKHI LETTER UU (U+0A0A).
There's no invocation of iconv
that'll decode it for you; you'll either need to take the heuristics you know, either based on character ranges or ordering in the file, to write a custom decoder (or ask for another copy in a standard encoding).
Upvotes: 3