Lin Ma
Lin Ma

Reputation: 10149

get UTF-8 encoded hex value for international character

Using Mac OSX and if there is a file encoded with UTF-8 (contains international characters besides ASCII), wondering if any tools or simple command (e.g. in Python 2.7 or shell) we can use to find the related hex (base-16) values (in terms of byte stream)? For example, if I write some Asian characters into the file, I can find the related hex value.

My current solution is I open the file and read them byte by byte using Python str. Wondering if any simpler ways without coding. :)

Edit 1, it seems the output of od is not correct,

cat ~/Downloads/12
1

od ~/Downloads/12
0000000    000061
0000001

Edit 2, tried od -t x1 options as well,

od -t x1 ~/Downloads/12
0000000    31
0000001

thanks in advance, Lin

Upvotes: 0

Views: 3530

Answers (3)

Robᵩ
Robᵩ

Reputation: 168716

od is the right command, but you need to specify an optional argument -t x1:

$ od -t x1 ~/Downloads/12
0000000 31
0000001

If you prefer not to see the file offsets, try adding -A none:

$ od -A none -t x1 ~/Downloads/12
 31

Additionally, the Linux man page (but not the OS X man page) lists this example: od -A x -t x1z -v, "Display hexdump format output."

Reference: http://www.unix.com/man-page/osx/1/od/

Upvotes: 1

Mark Tolonen
Mark Tolonen

Reputation: 177901

I'm not sure exactly what you want, but this script can help you look up the Unicode codepoint and UTF-8 byte sequence for any character. Be sure to save the source as UTF-8.

# coding: utf8
s = u'我是美国人。'
for c in s:
    print c,'U+{:04X} {}'.format(ord(c),repr(c.encode('utf8')))

Output:

我 U+6211 '\xe6\x88\x91'
是 U+662F '\xe6\x98\xaf'
美 U+7F8E '\xe7\xbe\x8e'
国 U+56FD '\xe5\x9b\xbd'
人 U+4EBA '\xe4\xba\xba'
。 U+3002 '\xe3\x80\x82'

Upvotes: 3

borrible
borrible

Reputation: 17366

You can use the command iconv to convert between encodings. The basic command is:

iconv -f from_encoding -t to_encoding inputfile

and you can see a list of supported encodings with

iconv --list

In your case,

iconv -f UTF8 -t UCS-2 inputfile

You've also asked to see the hex values. A standard utility that will do this is xxd. You can pipe the results of iconv to xxd as follows:

iconv -f UTF8 -t UCS-2 inputfile | xxd  

Upvotes: 1

Related Questions