Reputation: 4013
I can't understand what encoding approach uses Tunderbird while searching on IMAP server with command IMAP SEARCH CHARSET
I've tried to search Russian word "привет" and this was mapped to "?@825B", i.e.
A001 SEARCH CHARSET ISO-8859-1 BODY "?@825B"
How that happen? I'm sure this is correct as I've used sniffer for catch this and the Dovecot server correctly found the mail with "привет" word. The ISO-8859-1 encoding hasn't Russian glyphs at all! So how it was converted?
For example, "привет" (written as Unicode characters) gives "??????" for ISO-8859-1 encoding on my machine or here http://www.motobit.com/util/charset-codepage-conversion.asp
Upvotes: 2
Views: 1615
Reputation: 38653
The way that Thunderbird is getting this value is by downcasting a (16-bit?) unicode character to a byte.
For example, in C# (which uses UTF-16 internally for its char and string types), this would get the result you are seeing:
const string text = "привет";
var buffer = new char[text.Length];
for (int i = 0; i < text.Length; i++)
buffer[i] = (char) ((byte) text[i]);
var result = new string (buffer);
How Thunderbird handles surrogate pairs is anyone's guess based on what is known from the question. It might treat the surrogate pair as 2 separate characters (like my above code would) or it might combine them into a 32-bit unicode character and downcast that to a byte.
Upvotes: 0