Gilalar
Gilalar

Reputation: 105

tr changes file encoding?

I have an UTF-8 encoded text file containing a listing of names. I'm trying to make separate files for the consonants and the vowels and I managed with simple tr -d list-of-vowels, but for some reason the resulting file replaced ç with xA7 and it further gets replaced by § when I run the file through a sed script (and messes up the script in the process hence the issue). All the signs in the file (I've converted it to all lower case for ease of analysis):
bcdfghjklmnpqrstvwxzçðñ àáâãæèéêëìíîïòóôõøúüýaeiouyäåö '*,-./`#

For some reason only ç causes problems. The sed I'm using to calculate the number of each character by year in the file is sed -E -e 's/"([^"]*)","([^"]*)",.*/\L\2,\1/' -e 's/^([^,]+),(.)(.+)$/\1,\2\n\1,\3/; P; D' but I don't think there should be a problem with it.

The file under processing is a .csv file formatted:

"hanna","1919","2"  
"hanna","1919","2"  
"heidi","1919","2"  
"heidi","1919","2"  
"anja","1938","2"  
"anja","1938","2"  
"eila","1947","2"  
"eila","1947","2"  

Ordered first by year and then alphabetically.

Any clue on why tr is doing that and how to make it stop? I even tried to run sed -i "s/\ァ/\ç/g" but it didn't actually do anything. Yet ァ is how e.g. cat parses the bugged character.

Upvotes: 0

Views: 427

Answers (1)

that other guy
that other guy

Reputation: 123570

The current version of tr from GNU coreutils (8.29) does not support UTF-8.

One bug report suggests this is on the roadmap for version 9.

In the mean time, use sed.

Upvotes: 2

Related Questions