Reputation: 10149
Using Mac OSX and if there is a file encoded with UTF-8 (contains international characters besides ASCII), wondering if any tools or simple command (e.g. in Python 2.7 or shell) we can use to find the related hex (base-16) values (in terms of byte stream)? For example, if I write some Asian characters into the file, I can find the related hex value.
My current solution is I open the file and read them byte by byte using Python str. Wondering if any simpler ways without coding. :)
Edit 1, it seems the output of od
is not correct,
cat ~/Downloads/12
1
od ~/Downloads/12
0000000 000061
0000001
Edit 2, tried od -t x1
options as well,
od -t x1 ~/Downloads/12
0000000 31
0000001
thanks in advance, Lin
Upvotes: 0
Views: 3530
Reputation: 168716
od
is the right command, but you need to specify an optional argument -t x1
:
$ od -t x1 ~/Downloads/12
0000000 31
0000001
If you prefer not to see the file offsets, try adding -A none
:
$ od -A none -t x1 ~/Downloads/12
31
Additionally, the Linux man page (but not the OS X man page) lists this example: od -A x -t x1z -v
, "Display hexdump format output."
Reference: http://www.unix.com/man-page/osx/1/od/
Upvotes: 1
Reputation: 177901
I'm not sure exactly what you want, but this script can help you look up the Unicode codepoint and UTF-8 byte sequence for any character. Be sure to save the source as UTF-8.
# coding: utf8
s = u'我是美国人。'
for c in s:
print c,'U+{:04X} {}'.format(ord(c),repr(c.encode('utf8')))
Output:
我 U+6211 '\xe6\x88\x91'
是 U+662F '\xe6\x98\xaf'
美 U+7F8E '\xe7\xbe\x8e'
国 U+56FD '\xe5\x9b\xbd'
人 U+4EBA '\xe4\xba\xba'
。 U+3002 '\xe3\x80\x82'
Upvotes: 3
Reputation: 17366
You can use the command iconv
to convert between encodings. The basic command is:
iconv -f from_encoding -t to_encoding inputfile
and you can see a list of supported encodings with
iconv --list
In your case,
iconv -f UTF8 -t UCS-2 inputfile
You've also asked to see the hex values. A standard utility that will do this is xxd
. You can pipe the results of iconv
to xxd
as follows:
iconv -f UTF8 -t UCS-2 inputfile | xxd
Upvotes: 1