User1611
User1611

Reputation: 1099

How can I find extended ASCII characters in a file using Perl?

How can I find extended ASCII characters in a file using Perl? Can anyone get the script?

.....thanks in advance.....

Upvotes: 7

Views: 8478

Answers (6)

Impress TheNet
Impress TheNet

Reputation: 11

What about grep?

grep [\x00-\x1F\x7F-\xFF]+ *

Upvotes: 1

Reputation:

Hynek -Pichi- Vychodil's answer:

perl -nE'say$.if/[\xE0-\xFF]/'

only tests a limited part of the non-printing should presumably be

perl -nE'say$.if/[\x80-\xFF]/'

instead.

Upvotes: 2

Sinan Ünür
Sinan Ünür

Reputation: 118148

A crucial question is whether the

use bytes;

pragma should be in effect. The poster should decide that. For picking characters with codes greater than 127, the following will suffice:

print grep 127 < ord, split // while <>;

or

print grep /[^[:ascii:]]/, split // while <>;

Upvotes: 2

Hynek -Pichi- Vychodil
Hynek -Pichi- Vychodil

Reputation: 26121

Oneliner:

perl -nE'say$.if/[\xE0-\xFF]/'

for older perl versions

perl -lne'print$.if/[\xE0-\xFF]/'

Upvotes: 5

Stephan202
Stephan202

Reputation: 61549

Since the extended ASCII characters have value 128 and higher, you can just call ord on individual characters and handle those with a value >= 128. The following code reads from stdin and prints only the extended ASCII characters:

while (<>) {
  while (/(.)/g) {
    print($1) if (ord($1) >= 128);
  }
}

Alternatively, unpack together with chr will also work. Example:

while (<>) {
  foreach (unpack("C*", $_)) {
    print(chr($_)) if ($_ >= 128);
  }
}

(I'm sure some Perl guru can condense both of these to two one-liners...)


To print the line numbers instead, you can use the following (this does not remove duplicates, and will have odd behaviour when unicode is passed):

while (<>) {
  while (/(.)/g) {
    print($. . "\n") if (ord($1) >= 128);
  }
}

(Thanks Yaakov Belch for the $. tip.)

Upvotes: 10

Dave Sherohman
Dave Sherohman

Reputation: 46207

The first printable ASCII character is space (32). The last printable ASCII character is ~ (126). So I'd probably use

while (<>) {
  print "$.\n" if /[^ -~]/;
}

although it will, admittedly, also display lines containing control characters as well as extended ASCII.

Edit: Changed to print the line number rather than the line itself.

Upvotes: 8

Related Questions