Reputation: 105
I have an UTF-8 encoded text file containing a listing of names. I'm trying to make separate files for the consonants and the vowels and I managed with simple tr -d list-of-vowels
, but for some reason the resulting file replaced ç with xA7 and it further gets replaced by § when I run the file through a sed script (and messes up the script in the process hence the issue). All the signs in the file (I've converted it to all lower case for ease of analysis):
bcdfghjklmnpqrstvwxzçðñ àáâãæèéêëìíîïòóôõøúüýaeiouyäåö '*,-./`#
For some reason only ç causes problems. The sed
I'm using to calculate the number of each character by year in the file is sed -E -e 's/"([^"]*)","([^"]*)",.*/\L\2,\1/' -e 's/^([^,]+),(.)(.+)$/\1,\2\n\1,\3/; P; D'
but I don't think there should be a problem with it.
The file under processing is a .csv file formatted:
"hanna","1919","2"
"hanna","1919","2"
"heidi","1919","2"
"heidi","1919","2"
"anja","1938","2"
"anja","1938","2"
"eila","1947","2"
"eila","1947","2"
Ordered first by year and then alphabetically.
Any clue on why tr
is doing that and how to make it stop? I even tried to run sed -i "s/\ァ/\ç/g"
but it didn't actually do anything. Yet ァ is how e.g. cat
parses the bugged character.
Upvotes: 0
Views: 427
Reputation: 123570
The current version of tr
from GNU coreutils (8.29) does not support UTF-8.
One bug report suggests this is on the roadmap for version 9.
In the mean time, use sed
.
Upvotes: 2