chrisbunney
chrisbunney

Reputation: 5946

Why am I getting unexpected ] characters in this tr command

I have a tr command that is supposed to transliterate special characters into standard [a-z][A-Z] characters, because I'm tidying up input for something that can't accept special characters such as ÊÌÐÑÖØÙÜÝßàåæçèîïðõ.

However, it's not working as expected for my test input.

Command:

Input is Bío-Bío, in the echo command.

echo "Bío-Bío" | tr [ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ] [SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy]

Actual Output:

B]]o-B]]o

Desired Output:

Bio-Bio

Can anyone give me any insight into why this is happening?

EDIT0: I have checked and confirmed both strings in the tr command are the same length (69 characters).

Upvotes: 1

Views: 293

Answers (2)

jpa
jpa

Reputation: 12176

The tr command does not understand UTF8.

You might have better luck using iconv:

echo "Bío-Bío" | iconv -f utf8 -t ascii//translit

Upvotes: 3

Ernest Friedman-Hill
Ernest Friedman-Hill

Reputation: 81724

The two argument strings effectively differ in length because the characters in the first argument are encoded in multiple bytes in UTF-8. I copied the above and pasted into a script on my MacBook, then ran od -t x2 on it, and got the following:

0000000      6365    6f68    2220    c342    6fad    422d    adc3    226f
0000020      7c20    7420    2072    c55b    c5a0    c592    c5bd    c5a1
0000040      c593    c5be    c2b8    c2a5    c3b5    c380    c381    c382
0000060      c383    c384    c385    c386    c387    c388    c389    c38a
0000100      c38b    c38c    c38d    c38e    c38f    c390    c391    c392
0000120      c393    c394    c395    c396    c398    c399    c39a    c39b
0000140      c39c    c39d    c39f    c3a0    c3a1    c3a2    c3a3    c3a4
0000160      c3a5    c3a6    c3a7    c3a8    c3a9    c3aa    c3ab    c3ac
0000200      c3ad    c3ae    c3af    c3b0    c3b1    c3b2    c3b3    c3b4
0000220      c3b5    c3b6    c3b8    c3b9    c3ba    c3bb    c3bc    c3bd
0000240      5dbf    5b20    4f53    735a    7a6f    5959    4175    4141
0000260      4141    4141    4543    4545    4945    4949    4449    4f4e
0000300      4f4f    4f4f    554f    5555    5955    6173    6161    6161
0000320      6161    6563    6565    6965    6969    6f69    6f6e    6f6f
0000340      6f6f    756f    7575    7975    5d79    000a                

See all those 0xc3 bytes? Those are the high-order bytes of UTF-8 characters expressed in 16 bits.

As to how to fix: not sure. I wonder if using three-digit octal escapes (\nnn) to represent the strange characters would help.

Upvotes: 3

Related Questions