Gisli
Gisli

Reputation: 734

Get encoding of string read from file

I'm trying to convert a perl script into a powershell script. I'm having problems with a part of it when the script is reading a log file and has to get the encoding of the file.

Here is the perl code:

sub get_encoding {
my $f = shift;
my $fh;
return "ASCII" if (!open ($fh,"<",$f));
my $b = "";
my $n = read ($fh,$b,2);
close ($fh);
return "UTF-16" if ($b eq "\x{ff}\x{fe}");
return "ASCII";
}

it is called like so:

get_encoding ($l->{file})

Where $l->{file} is a path to the log file.

Can anyone explain what is going on, especially in this line:

return "UTF-16" if ($b eq "\x{ff}\x{fe}");

And if anyone knows a good way to do this in powershell, any tips are much apreciated.

Gísli

Upvotes: 0

Views: 1847

Answers (3)

gugod
gugod

Reputation: 830

The program reads and exams the first 2 bytes of the given file to decide whether it should return string "ASCII" or "UTF-16".

Here are some more detail description:

If the file cannot be opened, for whatever reason, it returns "ASCII". (Weird, but that's what it does.)

return "ASCII" if (!open ($fh,"<",$f));

If the file is opened as file handle $fh, read($fh, $b, 2) the first 2 (8-bit) bytes in to variable $b. The return value of read, which means the number of bytes actually read, gets stored to the variable $n, although it is never used latter.

my $b = "";
my $n = read ($fh,$b,2);

The file handle $fh gets to be closeed right after the read.

close ($fh);

If the value of $b is exactly "\x{ff}\x{fe}", the "UTF-16" is returned. Although it would be more exact to return "UTF-16BE". \x{..} is the representation of bytes by its hex value. Thus there are two bytes in "\x{ff}\x{fe}", not 10 or 12.

return "UTF-16" if ($b eq "\x{ff}\x{fe}");

At last, if $b is not equal to "\x{ff}\x{fe}", "ASCII" is returned.

return "ASCII";

Upvotes: 3

bpgergo
bpgergo

Reputation: 16037

the script read two bytes previously into $b from $f : my $n = read ($fh,$b,2);

the line in question test these two bytes whether they are literally FF and FE

I guess FF, FE is the byte order mark for UTF-16 little endian encoding see here http://unicode.org/faq/utf_bom.html

Upvotes: 1

CB.
CB.

Reputation: 60918

From http://franckrichard.blogspot.com/2010/08/powershell-get-encoding-file-type.html

    function Get-FileEncoding{
    [CmdletBinding()] Param (
[Parameter(Mandatory = $True, ValueFromPipelineByPropertyName = $True)] [string]$Path) 
    [byte[]]$byte = get-content -Encoding byte -ReadCount 4 -TotalCount 4 -Path $Path
    if ( $byte[0] -eq 0xef -and $byte[1] -eq 0xbb -and $byte[2] -eq 0xbf )
    { Write-Output 'UTF8' }
    elseif 
    ($byte[0] -eq 0xfe -and $byte[1] -eq 0xff)
    { Write-Output 'Unicode' }
    elseif ($byte[0] -eq 0 -and $byte[1] -eq 0 -and $byte[2] -eq 0xfe -and $byte[3] -eq 0xff)
    { Write-Output 'UTF32' }
    elseif ($byte[0] -eq 0x2b -and $byte[1] -eq 0x2f -and $byte[2] -eq 0x76)
    { Write-Output 'UTF7'}
    else
    { Write-Output 'ASCII' }}

Upvotes: 1

Related Questions