Reputation: 51
I am trying to get all the file names in the directory and determine which names contain special characters. I am using the regex
/[^-a-zA-Z0-9_.]/
SAMPLE FILES ( I created using touch ):
pdf-2014à014&7_06_64-Os_O&L,_Inc.pdf
pdf-20_06_04-O_OnLine,_Inc.pdf
pdf-20_0_0-Utà_d.wr.pdf
pdf-20_12_28-20.Mga_Grf.Fwd_Notice_KDJFI789&_JFK38.pdf
pdf-2_0_0-C_—_DUKE.pdf
pdf-2_1_3-f_s-M_F_D&A.pdf
pdf_-_2014à014&1007_0617_06264-O_O&L,_Inc.pdf
Perl can output the correct name once before I match the name for pattern in regex. Yes perl was somehow able to get match the special character but when outputing the character changes.
* pdf-2_0_0-C_—_DUKE.pdf > pdf-2_0_0-C_???_DUKE.pdf
I can try uncomment this line
#binmode(STDOUT, ":utf8");
and run the commmand script again. Sure the ? marks will be remove but the output is also different.
* pdf-2_0_0-C_—_DUKE.pdf > pdf-2_0_0-C_â_DUKE.pdf
Here is my code:
use strict;
use warnings;
use File::Find;
use Cwd;
#binmode(STDOUT, ":utf8");
my $starting_directory = cwd();
use Term::ANSIColor;
checkForSpecialChar(cwd());
sub checkForSpecialChar{
my ($source_dir) = @_;
chdir $source_dir or die qq(Cannot change into "$source_dir");
find ( sub {
return unless -f; #We want files only
print "\n";
while(m/([^-a-zA-Z0-9_.])/g){
chomp($_);
print "DETECTED: |" . $_ . "|\n";
print $`;
print color 'bold red';
print "$1";
print color 'reset';
print $' . "\n";
}
}, ".");
chdir("$starting_directory");
Any Idea guys?
UPDATE: hmm, you guys are right looks like its not a problem with regex. Hi AKHolland, I tried changing the code to look just like yours for testing. but still produce the same problem with hypen and a small letter a-grave . Instead of a small letter a-grave it gives me a` when not using binmode(STDOUT, ":utf8"); aÌ when using binmode(STDOUT, ":utf8");
use strict;
use warnings;
use File::Find;
use Cwd;
use Encode;
binmode(STDOUT, ":utf8");
my $starting_directory = cwd();
use Term::ANSIColor;
checkForSpecialChar(cwd());
sub checkForSpecialChar{
my ($source_dir) = @_;
chdir $source_dir
or die qq(Cannot change into "$source_dir");
find ( sub {
return unless -f; #We want files only
print $_ . "\n";
$_ = Encode::decode_utf8($_);
for(my $counter =0; $counter < length($_); $counter++) {
print Encode::encode_utf8(substr($_,$counter,1)) . "\n";
}
}, ".");
chdir("$starting_directory"); }
Output with binmode(STDOUT, ":utf8"); pdf-2_0_0-C_â_DUKE.pdf p d f - 2 _ 0 _ 0 - C _ â _ D U K E . p d f pdf_-_2014aÌ014&1007_0617_06264-O_O&L,_Inc.pdf p d f _ - _ 2 0 1 4 a Ì 0 1 4 & 1 0 0 7 _ 0 6 1 7 _ 0 6 2 6 4 - O _ O & L , _ I n c . p d f
OUTPUT without binmode(STDOUT, ":utf8"); pdf-2_0_0-C_—_DUKE.pdf p d f - 2 _ 0 _ 0 - C _ — _ D U K E . p d f pdf_-_2014à014&1007_0617_06264-O_O&L,_Inc.pdf p d f _ - _ 2 0 1 4 a ̀ 0 1 4 & 1 0 0 7 _ 0 6 1 7 _ 0 6 2 6 4 - O _ O & L , _ I n c . p d f
Upvotes: 5
Views: 1426
Reputation: 311
The character "—" is the Em Dash, from the Unicode character set, with code point U+2014.
Characters with code points from U+0800 to U+FFFF are encoded using three bytes in UTF-8.
For a 3-byte encoding, 16 bits of the binary version of the code point, 2014h, are considered. Convert the hex to binary, and you get 0010 0000 0001 0100b.
That's two bytes. Where do we get three bytes from then? That's because, the UTF-8 encoding rules entail:
Your line binmode(STDOUT, ":utf8");
merely encodes the output going on STDOUT to UTF-8. However, as mentioned in the comments to another answer, the Windows file system (NTFS) uses UTF-16 for filenames - and a UTF-16 encoding uses two bytes for encoding characters in the said range, not three. These 2 bytes are just numerically equal to the code point itself, 2014h.
Thus, you also need to decode the input. AKHolland's answer tells you how.
Upvotes: 0
Reputation: 91415
The character —
in pdf-2_0_0-C_—_DUKE.pdf
is encoded with 3 char in utf-8:
char Unicode UTF-8
— U+2014 \xe2\x80\x94
so, as said @AKHolland, you have to encode it.
Upvotes: 1
Reputation: 4445
You need to decode it on the way in and encode it on the way out. Something like this:
use Encode;
find ( sub {
$_ = Encode::decode_utf8($_);
while(m/([^-a-zA-Z0-9_.])/g){
my $chr = Encode::encode_utf8($1);
print "$chr\n"
}
}, ".");
Upvotes: 3