Reputation: 5946
I have a tr
command that is supposed to transliterate special characters into standard [a-z][A-Z] characters, because I'm tidying up input for something that can't accept special characters such as ÊÌÐÑÖØÙÜÝßàåæçèîïðõ.
However, it's not working as expected for my test input.
Command:
Input is Bío-Bío
, in the echo command.
echo "Bío-Bío" | tr [ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ] [SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy]
Actual Output:
B]]o-B]]o
Desired Output:
Bio-Bio
Can anyone give me any insight into why this is happening?
EDIT0:
I have checked and confirmed both strings in the tr
command are the same length (69 characters).
Upvotes: 1
Views: 293
Reputation: 12176
The tr
command does not understand UTF8.
You might have better luck using iconv
:
echo "Bío-Bío" | iconv -f utf8 -t ascii//translit
Upvotes: 3
Reputation: 81724
The two argument strings effectively differ in length because the characters in the first argument are encoded in multiple bytes in UTF-8. I copied the above and pasted into a script on my MacBook, then ran od -t x2
on it, and got the following:
0000000 6365 6f68 2220 c342 6fad 422d adc3 226f
0000020 7c20 7420 2072 c55b c5a0 c592 c5bd c5a1
0000040 c593 c5be c2b8 c2a5 c3b5 c380 c381 c382
0000060 c383 c384 c385 c386 c387 c388 c389 c38a
0000100 c38b c38c c38d c38e c38f c390 c391 c392
0000120 c393 c394 c395 c396 c398 c399 c39a c39b
0000140 c39c c39d c39f c3a0 c3a1 c3a2 c3a3 c3a4
0000160 c3a5 c3a6 c3a7 c3a8 c3a9 c3aa c3ab c3ac
0000200 c3ad c3ae c3af c3b0 c3b1 c3b2 c3b3 c3b4
0000220 c3b5 c3b6 c3b8 c3b9 c3ba c3bb c3bc c3bd
0000240 5dbf 5b20 4f53 735a 7a6f 5959 4175 4141
0000260 4141 4141 4543 4545 4945 4949 4449 4f4e
0000300 4f4f 4f4f 554f 5555 5955 6173 6161 6161
0000320 6161 6563 6565 6965 6969 6f69 6f6e 6f6f
0000340 6f6f 756f 7575 7975 5d79 000a
See all those 0xc3
bytes? Those are the high-order bytes of UTF-8 characters expressed in 16 bits.
As to how to fix: not sure. I wonder if using three-digit octal escapes (\nnn) to represent the strange characters would help.
Upvotes: 3