takuji
takuji

Reputation: 51

Outputting special characters correctly (unicode) in PERL

I am trying to get all the file names in the directory and determine which names contain special characters. I am using the regex

/[^-a-zA-Z0-9_.]/

SAMPLE FILES ( I created using touch ):

pdf-2014à014&7_06_64-Os_O&L,_Inc.pdf
pdf-20_06_04-O_OnLine,_Inc.pdf
pdf-20_0_0-Utà_d.wr.pdf
pdf-20_12_28-20.Mga_Grf.Fwd_Notice_KDJFI789&_JFK38.pdf
pdf-2_0_0-C_—_DUKE.pdf
pdf-2_1_3-f_s-M_F_D&A.pdf
pdf_-_2014à014&1007_0617_06264-O_O&L,_Inc.pdf

Perl can output the correct name once before I match the name for pattern in regex. Yes perl was somehow able to get match the special character but when outputing the character changes.

* pdf-2_0_0-C_—_DUKE.pdf          >        pdf-2_0_0-C_???_DUKE.pdf

I can try uncomment this line

   #binmode(STDOUT, ":utf8");

and run the commmand script again. Sure the ? marks will be remove but the output is also different.

* pdf-2_0_0-C_—_DUKE.pdf          >        pdf-2_0_0-C_â_DUKE.pdf

Here is my code:

use strict;
use warnings;
use File::Find;
use Cwd;

#binmode(STDOUT, ":utf8");

my $starting_directory = cwd();

use Term::ANSIColor;

checkForSpecialChar(cwd());


sub checkForSpecialChar{
    my ($source_dir) = @_;

    chdir $source_dir or die qq(Cannot change into "$source_dir");

    find ( sub {
        return unless -f;   #We want files only
        print "\n";
        while(m/([^-a-zA-Z0-9_.])/g){ 
            chomp($_);
            print "DETECTED: |" . $_ . "|\n";
            print $`;
            print color 'bold red';
            print "$1";
            print color 'reset';
            print  $' . "\n";

        }

    }, ".");

    chdir("$starting_directory");

Any Idea guys?

UPDATE: hmm, you guys are right looks like its not a problem with regex. Hi AKHolland, I tried changing the code to look just like yours for testing. but still produce the same problem with hypen and a small letter a-grave . Instead of a small letter a-grave it gives me a` when not using binmode(STDOUT, ":utf8"); aÌ when using binmode(STDOUT, ":utf8");

use strict;
use warnings;
use File::Find;
use Cwd;
use Encode;
binmode(STDOUT, ":utf8");

my $starting_directory = cwd();

use Term::ANSIColor;

checkForSpecialChar(cwd());


sub checkForSpecialChar{
   my ($source_dir) = @_;

   chdir $source_dir
       or die qq(Cannot change into "$source_dir");

   find ( sub {
      return unless -f;   #We want files only
     print $_ . "\n";
      $_ = Encode::decode_utf8($_);
      for(my $counter =0; $counter < length($_); $counter++) {
        print Encode::encode_utf8(substr($_,$counter,1)) .  "\n";
      } 

}, ".");

chdir("$starting_directory"); }

Output with

    binmode(STDOUT, ":utf8");

pdf-2_0_0-C_â_DUKE.pdf
p
d
f
-
2
_
0
_
0
-
C
_
â
_
D
U
K
E
.
p
d
f
pdf_-_2014aÌ014&1007_0617_06264-O_O&L,_Inc.pdf
p
d
f
_
-
_
2
0
1
4
a
Ì
0
1
4
&
1
0
0
7
_
0
6
1
7
_
0
6
2
6
4
-
O
_
O
&
L
,
_
I
n
c
.
p
d
f
OUTPUT without 
    binmode(STDOUT, ":utf8");

pdf-2_0_0-C_—_DUKE.pdf
p
d
f
-
2
_
0
_
0
-
C
_
—
_
D
U
K
E
.
p
d
f
pdf_-_2014à014&1007_0617_06264-O_O&L,_Inc.pdf
p
d
f
_
-
_
2
0
1
4
a
̀
0
1
4
&
1
0
0
7
_
0
6
1
7
_
0
6
2
6
4
-
O
_
O
&
L
,
_
I
n
c
.
p
d
f

Upvotes: 5

Views: 1426

Answers (3)

Anupama G
Anupama G

Reputation: 311

The character "—" is the Em Dash, from the Unicode character set, with code point U+2014.

Characters with code points from U+0800 to U+FFFF are encoded using three bytes in UTF-8.

For a 3-byte encoding, 16 bits of the binary version of the code point, 2014h, are considered. Convert the hex to binary, and you get 0010 0000 0001 0100b.

That's two bytes. Where do we get three bytes from then? That's because, the UTF-8 encoding rules entail:

  • In a 3-byte encoding, the 'leading' (highest value) byte must start with three '1' bits followed by one '0' bit. Thus, our leading byte takes the form 1110 followed by the first 4 bits we have; i.e. 11100010b or E2h.
  • All 'continuation' bytes (those after the leading byte) start with '10'. Thus, our second byte becomes 10 followed by the next 6 bits we have; i.e. 10000000b or 80h.
  • Likewise, the third byte (which is the second continuation byte) becomes 10 followed by our remaining 6 bits; i.e. 10010100b or 94h.

Your line binmode(STDOUT, ":utf8"); merely encodes the output going on STDOUT to UTF-8. However, as mentioned in the comments to another answer, the Windows file system (NTFS) uses UTF-16 for filenames - and a UTF-16 encoding uses two bytes for encoding characters in the said range, not three. These 2 bytes are just numerically equal to the code point itself, 2014h.

Thus, you also need to decode the input. AKHolland's answer tells you how.

Upvotes: 0

Toto
Toto

Reputation: 91415

The character in pdf-2_0_0-C_—_DUKE.pdf is encoded with 3 char in utf-8:

char Unicode   UTF-8
—    U+2014    \xe2\x80\x94

so, as said @AKHolland, you have to encode it.

Upvotes: 1

AKHolland
AKHolland

Reputation: 4445

You need to decode it on the way in and encode it on the way out. Something like this:

use Encode;
find ( sub {
    $_ = Encode::decode_utf8($_);
    while(m/([^-a-zA-Z0-9_.])/g){
        my $chr = Encode::encode_utf8($1);
        print "$chr\n"
    }
}, ".");

Upvotes: 3

Related Questions