Reputation: 2076
I have a string in UTF-8 hex like this:
s = "0059006F007500720020006300720065006400690074002000680061007300200067006F006E0065002000620065006C006F00770020003500200064006F006C006C006100720073002E00200049006600200079006F00750020006800610076006500200061006E0020004100640064002D004F006E0020006F007200200042006F006E0075007300200079006F007500720020007200650073006F00750072006300650073002000770069006C006C00200077006F0072006B00200075006E00740069006C0020006500780068006100750073007400650064002E00200054006F00200074006F00700020007500700020006E006F007700200076006900730069007400200076006F006400610066006F006E0065002E0063006F002E006E007A002F0074006F007000750070"
I want to convert this into actual UTF-8 string. It should read:
Your credit has gone below 5 dollars. If you have an Add-On or Bonus your resources will work until exhausted. To top up now visit vodafone.co.nz/topup
This works:
s.scan(/.{4}/).map { |a| [a.hex].pack('U') }.join
but I'm wondering if there's a better way to do this: whether I should be using Encoding#convert.
Upvotes: 1
Views: 3170
Reputation: 42109
If you are intending to use this on other oddly encoded strings, you could unpad the leading bytes:
[s.gsub(/..(..)/,'\1')].pack('H*')
Or use them:
s.gsub(/..../){|p|p.hex.chr}
If you want to use Encoding::Converter
ec = Encoding::Converter.new('UTF-16BE','UTF-8') # save converter for reuse
ec.convert( [s].pack('H*') ) # or: ec.convert [s].pack'H*'
Upvotes: 1
Reputation: 79733
The extra 00
s suggest that the string is actually the hex representation of a UTF-16 string, rather than UTF-8. Assuming that is the case the steps you need to carry out to get a UTF-8 string are first convert the string into the actual bytes the hex digits represents (Array#pack
can be used for this), second mark it as being in the appropriate encoding with force_encoding
(which looks like UTF-16BE) and finally use encode
to convert it to UTF-8:
[s].pack('H*').force_encoding('utf-16be').encode('utf-8')
Upvotes: 5
Reputation: 19221
I think there are extra null characters all along the string (it's valid, but wasteful), but you can try:
[s].pack('H*').force_encoding('utf-8')
although, it seems "Your credit has gone below 5 dollars"...
The string prints with puts
, but I can't read all the unicode characters on the terminal when the string is dumped.
Upvotes: 2