Reputation: 2564

Search specific non-ASCII chars in Unix

Is it possible to search ï¿½ set on non-ASCII chars in a file in unix?

I want to search all these characters in bash to replace them with two spaces.

sed -i 's/[ï¿½]/\ \ /g' filename worked worked finally

Upvotes: 0

Answers (2)

user4815162342

Reputation: 154921

The way to search for those chars will depend on their encoding in the file. If the file is in the UTF-8 encoding, you can set the UTF-8 locale and simply match them from the shell. Assuming GNU sed (the default on Linux), the command line will look like this:

LANG=C.UTF-8 sed -i 's/[ï¿½]/  /g' filename

For this to work, you must be in a UTF-8-compliant shell, so that e.g. echo 'ï' | wc -c outputs 3 (two UTF-8 code units plus newline).

Upvotes: 1

tripleee

Reputation: 189387

You seem to be looking at UTF-8 data using a Latin-1 tool. Hence, your question is basically ill-defined, but assuming you want to find files containing a UTF-8 replacement character, try something like

perl -CSD -nle 'if m/^\x{FFFD}/ { print $ARGV; close() }' files ...

Here's what I used to understand your question:

$ echo -n 'ï¿½' | iconv -t iso-8859-1 | xxd
0000000: efbf bd

Googling for efbfbd quickly brought up http://www.fileformat.info/info/unicode/char/0fffd/index.htm among the top hits.

Note also that U+FFFD is basically an error code. You should properly not find and replace it. You should find out which previous encoding step failed and produced this, and fix that instead.

Upvotes: 1

Search specific non-ASCII chars in Unix

Answers (2)

Related Questions