Juicy
Juicy

Reputation: 12530

Decoding strings of UTF-16 encoded hex characters

I have a file that is some clear text hex bytes (except for the first 18 bytes) but the file encoding is UTF-16. Here's a short dump of the file:

00000000  ff fe 35 1f d3 bb 7a ef  df 45 92 df be ff 33 c2  |..5...z..E....3.|
00000010  af c7 30 00 42 00 45 00  33 00 45 00 35 00 45 00  |..0.B.E.3.E.5.E.|
00000020  35 00 44 00 35 00 44 00  41 00 36 00 44 00 38 00  |5.D.5.D.A.6.D.8.|
00000030  42 00 41 00 30 00 37 00  39 00 42 00 46 00 34 00  |B.A.0.7.9.B.F.4.|
00000040  46 00 31 00 45 00 41 00  36 00 37 00 32 00 34 00  |F.1.E.A.6.7.2.4.|
00000050  42 00 39 00 43 00 42 00  41 00 42 00 45 00 44 00  |B.9.C.B.A.B.E.D.| 
...

I would like to read this file line by line (it has \r\n line breaks) and get the hex data from a string. If this were a ASCII string I could just do this:

a_line = '00112233445566778899'
hex_data = a_line.decode('hex')

But because it's UTF-16 I get a Non-hexadecimal digit error when trying this approach.

My question is, how can I load a string of UTF-16 encoded hex characters as hex data?

Upvotes: 2

Views: 788

Answers (1)

jcoppens
jcoppens

Reputation: 5440

00000000  ff fe 35 1f d3 bb 7a ef  df 45 92 df be ff 33 c2  |..5...z..E....3.|
00000010  af c7 30 00 42 00 45 00  33 00 45 00 35 00 45 00  |..0.B.E.3.E.5.E.|

The first line contains non-Hex characters 35 1f d3 bb 7a ef ... af c7. So, beware when decoding - it's not pure Hex.

You can read this file using the io module, where you can explicitly declare the file coding:

def main(args):
    with io.open(testfile, "r", encoding = 'utf-16') as inf:
        lines = inf.readlines()

    for line in lines:
        print(line)

    return 0

Newlines should be detected automatically, but you can explicitly define them in the io.open with an extra parameter (, newline = "\r\n")

Once read, you should be able to .decode normally.

Upvotes: 1

Related Questions