Haifa Warda
Haifa Warda

Reputation: 135

perl output - failing in printing utf8 text files correctly

so i have utf8 text files, which i want to read in, put the lines into an array, and print it out. But the output however doesn't print the signs correctly, for example the output line looks like following:

"arnſtein gehört gräflichen "

So i tried testing the script by one line, pasted directly into the perl script, without reading it from file. And there the output is perfectly fine. I checked the files, which are in utf8 unicode. Still the files must cause the output problem (?).

Because the script is too long, i just cut it down to the relevant: ( goes to directory, opens files, leads the input to the function &align, anaylse it, add it to an array, print the array)

#!/usr/bin/perl -w
use strict;

use utf8;
binmode(STDIN,":utf8");
binmode(STDOUT,":utf8");
binmode(STDERR,":utf8");

#opens directory
#opens file from directory
 if (-d "$dir/$first"){
  opendir (UDIR, "$dir/$first") or die "could not open: $!";
  foreach my $t (readdir(UDIR)){
   next if $first eq ".";
   next if $first eq "..";

   open(GT,"$dir/$first/$t") or die "Could not open GT, $!";
   my $gt= <GT>;
   chomp $gt;

   #directly pasted lines in perl   - creates correct output
   &align("det man die Profeſſores der Philoſophie re- ");

    #lines from file    - output not correct
    #&align($gt);
    close GT;
    next;

  }closedir UDIR;
}

Any idea ?

Upvotes: 2

Views: 345

Answers (1)

cjm
cjm

Reputation: 62109

You told Perl that your source code was UTF-8, and that STDIN, STDOUT, & STDERR are UTF-8, but you didn't say that the file you're reading contains UTF-8.

open(GT,"<:utf8", "$dir/$first/$t") or die "Could not open GT, $!";

Without that, Perl assumes the file is encoded in ISO-8859-1, since that's Perl's default charset if you don't specify a different one. It helpfully transcodes those ISO-8859-1 characters to UTF-8 for output, since you've told it that STDOUT uses UTF-8. Since the file was actually UTF-8, not ISO-8859-1, you get incorrect output.

Upvotes: 3

Related Questions