Reputation: 15503
I have a Ruby program running on Windows which calls a shell command (which is known to output UTF-16) using Open3:
attrs={}
attrs[:stdout], attrs[:stderr], status = Open3.capture3(command)
unless attrs[:stderr].nil?
begin
attrs[:stderr].force_encoding(Encoding::UTF_16LE).encode!(Encoding::UTF_8)
rescue => e
attrs[:stderr] = attrs[:stderr].bytes.to_json.encode!(Encoding::UTF_8)
end
end
If the force_encoding to UTF_16LE doesn't work, and throws an exception, I simply save the bytes, encode it as a JSON string and encode it as UTF_8.
Well....the exception was thrown and I caught the output array of bytes in the rescue clause. It looks like this:
[10,84,104,105,115,32,97,112,112,108,105,99,97,116,105,111,110,32,104,97,115,32,114,101,113,117,101,115,116,101,100,32,116,104,101,32,82,117,110,116,105,109,101,32,116,111,32,116,101,114,109,105,110,97,116,101,32,105,116,32,105,110,32,97,110,32,117,110,117,115,117,97,108,32,119,97,121,46,10,80,108,101,97,115,101,32,99,111,110,116,97,99,116,32,116,104,101,32,97,112,112,108,105,99,97,116,105,111,110,39,115,32,115,117,112,112,111,114,116,32,116,101,97,109,32,102,111,114,32,109,111,114,101,32,105,110,102,111,114,109,97,116,105,111,110,46,10]
How can I convert it back to text in some format. e.g. If I do:
irb> "dog".bytes
=> [100, 111, 103]
irb> "कुत्रा".bytes
=> [224, 164, 149, 224, 165, 129, 224, 164, 164, 224, 165, 141, 224, 164, 176, 224, 164, 190]
Is there a way to programmatically convert [100, 111, 103] to "dog" or [224, 164, 149, 224, 165, 129, 224, 164, 164, 224, 165, 141, 224, 164, 176, 224, 164, 190] back to "कुत्रा" ? And is there a way to figure out what my output array of bytes means?
------------------------- UPDATE ---------------------------
I dug around a bit, but it took a while, because "decode" is not a thing. However, I did the following with the array which I held in the variable message:
message.map{|c| c.chr}.join("")
=> "\nThis application has requested the Runtime to terminate it in an unusual way.\nPlease contact the application's support team for more information.\n"
So my problem is solved, in that the error message is not in UTF-16LE.
However, when I did this, I got the result which follows:
irb> "कुत्रा".bytes.map{|c| c.chr}.join("")
=> "\xE0\xA4\x95\xE0\xA5\x81\xE0\xA4\xA4\xE0\xA5\x8D\xE0\xA4\xB0\xE0\xA4\xBE"
How do I convert this strange looking string or byte sequence into the more meaningful "कुत्रा" ?
Upvotes: 1
Views: 2476
Reputation: 1197
Answering your first question about the bytes, take a look at the Pack method in array: docs.
[100, 111, 103].pack('U*') # Returns 'dog'.
The 'U*' formating tries to match in the bytes array as many UTF8 characters as possible.
If you use that method in your error message you get:
"\nThis application has requested the Runtime to terminate it in an unusual way.\nPlease contact the application's support team for more information.\n"
------------------------- UPDATE ---------------------------
Just noticed you figured the first part out and added a new question.
How do I convert this strange looking string or byte sequence into the more meaningful "कुत्रा" ?
When you execute "string".bytes.map{|c| c.chr}.join("")
the bytes on the new string are the same, but the encoding is lost. This can be seen here:
s = "dog"
s.encoding #=> #<Encoding:UTF-8>
s = "dog".bytes.map{|c| c.chr}.join("") #=> "dog"
s.encoding #=> #<Encoding:US-ASCII>
This has the expected effect with strings like 'dog', because UTF-8 is backwards compatible with ASCII-8BIT, this means that string that use only ASCII-8BIT characters will work in UTF-8. But with characters that use more than 1 byte in UTF-8, like '€', they are not recognized in ASCII. So, to answer you question, what you need to do is force the appropriate encoding on the string, like this:
"कुत्रा".bytes.map{|c| c.chr}.join("").force_encoding('UTF-8') #=> "कुत्रा"
Hope it helps
Upvotes: 3
Reputation: 1325
Is there a way to programmatically convert [100, 111, 103] to "dog"?
pry(main)> "dog".bytes.pack('c*')
=> "dog"
For the other letters, try the same or "कुत्रा".bytes.pack('U*'). I can't use those Marathi (Ehh It also means 'dog' lol) in my pc
How do I convert this strange looking string or byte sequence into the more meaningful "कुत्रा" ?
pry(main)> p "कुत्रा".bytes.map{|c| c.chr}.join("")
=> "\xE0\xA4\x95\xE0\xA5\x81\xE0\xA4\xA4\xE0\xA5\x8D\xE0\xA4\xB0\xE0\xA4\xBE"
pry(main)> puts "कुत्रा".bytes.map{|c| c.chr}.join("")
=> कुत्रा
Which is basically:
puts "\xE0\xA4\x95\xE0\xA5\x81\xE0\xA4\xA4\xE0\xA5\x8D\xE0\xA4\xB0\xE0\xA4\xBE"
Upvotes: 1