Reputation: 25
I'm attempting to do some basic text analysis from the command line, but whenever I try to run a command, I get the following: tr: Illegal byte sequence. I've narrowed the problem down to the special characters within the text (´, ˆ,¨, etc.) Is there something I can do to remove these special characters from the text? Can I use the command line? Or do I have to run a script?
Upvotes: 2
Views: 8549
Reputation: 511
I don't know how you are trying to process your text, but apparently you are trying to run tr
, which gives you the error message tr: Illegal byte sequence
. This happens when its input is not a byte sequence that corresponds to a valid UTF-8 encoding (not all byte sequences correspond to the UTF-8 encoding of a series of Unicode characters).
I do not know what kind of file you are trying to process, but in a MacOS X environment the command file -I
might give you an idea of the encoding that is actually there.
If it is just a matter of recoding your file, then iconv
is a useful program. You can use it to recode to UTF-8 encoding by using iconv -f ... -t utf8
(where ...
is the encoding of your original file, run iconv -l
for a list of encodings that are available that way).
Or if you really want to remove the special characters in your file (as you state in the title of your question), you can use iconv -f ... -t ascii//TRANSLIT
. In this last case, the "special characters" will be approximated by normal ASCII characters.
Upvotes: 1