Reputation: 349
I am having some inconsistencies when using hexdump and xxd. When I run the following command:
echo -n "a42d9dfe8f93515d0d5f608a576044ce4c61e61e" \
| sed 's/\(..\)/\1\n/g' \
| awk '/^[a-fA-F0-9]{2}$/ { printf("%c",strtonum("0x" $0)); }' \
| xxd
it returns the following results:
00000000: c2a4 2dc2 9dc3 bec2 8fc2 9351 5d0d 5f60 ..-........Q]._`
00000010: c28a 5760 44c3 8e4c 61c3 a61e ..W`D..La...
Note the "c2" characters. This also happens with I run xxd -p
When I run the same command except with hexdump -C:
echo -n "a42d9dfe8f93515d0d5f608a576044ce4c61e61e" \
| sed 's/\(..\)/\1\n/g' \
| awk '/^[a-fA-F0-9]{2}$/ { printf("%c",strtonum("0x" $0)); }' \
| hexdump -C
I get the same results (as far as including the "c2" character):
00000000 c2 a4 2d c2 9d c3 be c2 8f c2 93 51 5d 0d 5f 60 |..-........Q]._`|
00000010 c2 8a 57 60 44 c3 8e 4c 61 c3 a6 1e |..W`D..La...|
However, when I run hexdump with no arguments:
echo -n "a42d9dfe8f93515d0d5f608a576044ce4c61e61e" \
| sed 's/\(..\)/\1\n/g' \
| awk '/^[a-fA-F0-9]{2}$/ { printf("%c",strtonum("0x" $0)); }' \
| hexdump
I get the following [correct] results:
0000000 a4c2 c22d c39d c2be c28f 5193 0d5d 605f
0000010 8ac2 6057 c344 4c8e c361 1ea6
For the purpose of this script, I'd rather use xxd as opposed to hexdump. Thoughts?
Upvotes: 0
Views: 240
Reputation: 881
Why not use xxd with -r and -p?
echo a42d9dfe8f93515d0d5f608a576044ce4c61e61e | xxd -r -p | xxd
output
0000000: a42d 9dfe 8f93 515d 0d5f 608a 5760 44ce .-....Q]._`.W`D.
0000010: 4c61 e61e La..
Upvotes: 1
Reputation: 8304
The problem that you observe is due to UTF-8 encoding and little-endiannes.
First, note that when you try to print any Unicode character in AWK, like 0xA4 (CURRENCY SIGN), it actually produces two bytes of output, like the two bytes 0xC2 0xA4 that you see in your output:
$ echo 1 | awk 'BEGIN { printf("%c", 0xA4) }' | hexdump -C
Output:
00000000 c2 a4 |..|
00000002
This holds for any character bigger than 0x7F and it is due to UTF-8 encoding, which is probably the one set in your locale. (Note: some AWK implementations will have different behavior for the above code.)
Secondly, when you use hexdump
without argument -C
, it displays each pair of bytes in swapped order due to little-endianness of your machine. This is because each pair of bytes is then treated as a single 16-bit word, instead of treating each byte separately, as done by xxd
and hexdump -C
commands. So the xxd
output that you get is actually the correct byte-for-byte representation of input.
Thirdly, if you want to produce the precise byte string that is encoded in the hexadecimal string that you are feeding to sed, you can use this Python solution:
echo -n "a42d9dfe8f93515d0d5f608a576044ce4c61e61e" | sed 's/\(..\)/0x\1,/g' | python3 -c "import sys;[open('tmp','wb').write(bytearray(eval('[' + line + ']'))) for line in sys.stdin]" && cat tmp | xxd
Output:
00000000: a42d 9dfe 8f93 515d 0d5f 608a 5760 44ce .-....Q]._`.W`D.
00000010: 4c61 e61e La..
Upvotes: 1