user2567544
user2567544

Reputation: 597

Searching for non-ascii characters

I have a file, a.out, which contains a number of lines. Each line is one character only, either the unicode character U+2013 or a lower case letter a-z.

Doing a file command on a.out elicits the result UTF-8 Unicode text.

The locale command reports:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

If I issue the command grep -P -n "[^\x00-\xFF]" a.out I would expect only the lines containing U+2013 to be returned. And this is the case if I carry out the test under cygwin. The problem environment however is Oracle Linux Server release 6.5 and the issue is that the grep command returns no lines. If I issue grep -P -n "[\x00-\xFF]" a.out then all lines are returned.

I realise that "[grep -P]...is highly experimental and grep -P may warn of unimplemented features." but no warnings are issued.

Am I missing something?

Upvotes: 1

Views: 1309

Answers (3)

tchrist
tchrist

Reputation: 80443

I recommend avoiding dodgy grep -P implementations and use the real thing. This works:

perl -CSD -nle 'print "$.: $_" if /\P{ASCII}/' utfile1 utfile2 utfile3 ...

Where:

  • The -CSD options says that both the stdio trio (stdin, stdout, stderr) and disk files should be treated as UTF-8 encoded.

  • The $. represents the current record (line) number.

  • The $_ represents the current line.

  • The \P{ASCII} matches any code point that is not ASCII.

Upvotes: 3

Kent
Kent

Reputation: 195249

gawk can help you for this problem,

here is the awk one-liner:

 awk -v FS="" 'BEGIN{for(i=1;i<128;i++)ord[sprintf("%c",i)]=i}
               {for(i=1;i<=NF;i++)if(!($i in ord))print $i}' file

below is a test with gawk:

kent$  cat f
abcd
+ß
s+äö
ö--我
中文

kent$  awk -v FS="" 'BEGIN{for(i=1;i<128;i++)ord[sprintf("%c",i)]=i}{for(i=1;i<=NF;i++)if(!($i in ord))print $i}' f
ß
ä
ö
ö
我
中
文

Upvotes: 0

Thomas Dickey
Thomas Dickey

Reputation: 54583

A comment in How Do I grep For all non-ASCII Characters in UNIX gives the answer:

Grep (and family) don't do Unicode processing to merge multi-byte characters into a single entity for regex matching as you seem to want.

That implies that the UTF-8 encoding for U+2013 (0xe2, 0x80, 0x93) is not treated by grep as parts of a single printable character outside the given range.

The GNU grep manual's description of -P does not mention Unicode or UTF-8. Rather, it says Interpret the pattern as a Perl regular expression. (this does not mean that the result is identical to Perl, only that some of the backslash-escapes are similar).

Perl itself can be told to use UTF-8 encoding. However the examples using Perl in Filtering invalid utf8 do not use that feature. Instead, the expressions (like those in the problematic grep) test only the individual bytes -- not the complete character.

Upvotes: 0

Related Questions