Mayank Jain
Mayank Jain

Reputation: 2564

Search specific non-ASCII chars in Unix

Is it possible to search � set on non-ASCII chars in a file in unix?

I want to search all these characters in bash to replace them with two spaces.

sed -i 's/[�]/\ \ /g' filename worked worked finally

Upvotes: 0

Views: 202

Answers (2)

user4815162342
user4815162342

Reputation: 154921

The way to search for those chars will depend on their encoding in the file. If the file is in the UTF-8 encoding, you can set the UTF-8 locale and simply match them from the shell. Assuming GNU sed (the default on Linux), the command line will look like this:

LANG=C.UTF-8 sed -i 's/[�]/  /g' filename

For this to work, you must be in a UTF-8-compliant shell, so that e.g. echo 'ï' | wc -c outputs 3 (two UTF-8 code units plus newline).

Upvotes: 1

tripleee
tripleee

Reputation: 189387

You seem to be looking at UTF-8 data using a Latin-1 tool. Hence, your question is basically ill-defined, but assuming you want to find files containing a UTF-8 replacement character, try something like

perl -CSD -nle 'if m/^\x{FFFD}/ { print $ARGV; close() }' files ...

Here's what I used to understand your question:

$ echo -n '�' | iconv -t iso-8859-1 | xxd
0000000: efbf bd                          

Googling for efbfbd quickly brought up http://www.fileformat.info/info/unicode/char/0fffd/index.htm among the top hits.

Note also that U+FFFD is basically an error code. You should properly not find and replace it. You should find out which previous encoding step failed and produced this, and fix that instead.

Upvotes: 1

Related Questions