charlesbridge
charlesbridge

Reputation: 1212

Perl File Test for Text -T and PDFs

I'm trying to limit my Perl-Tk code to only open text files to edit. I'm testing to make sure the user selected a valid file (I'm using Tks getOpenFile() ):

if ( (defined $file) and (-f $file) and (-T $file) ) {
  #work with file
}

The problem I've run into is that some PDF files pass the -T test and get opened (causing much chaos). I tried this code in a directory full of PDFs:

#!/usr/bin/perl

use strict;
use warnings;

my @files = <*>;
foreach (@files) {
  if (-T) { print "$_ is a text file\n"};
}

About 1/2 the PDFs in the directory get printed.

Am I using -T wrong? Will I have to add a regex to filter out PDFs? And how come Perl thinks only some of the PDFs are text?

EDIT: -T is a file test that should return true if the file is plain text. I'm not trying to check for taint.

Upvotes: 1

Views: 1942

Answers (5)

Borodin
Borodin

Reputation: 126722

You are using -T correctly: it is just a best guess rather than an absolute classification. It may help to know that PDF files carry a fourcc of %PDF which you can check easily with a subroutine like this

sub isPDF {
  open my $fh, '<', shift or return;
  read $fh, my $fourcc, 4;
  return $fourcc eq '%PDF';
}

Upvotes: 2

Chris Dolan
Chris Dolan

Reputation: 8963

Most PDFs have a few binary characters right after the %PDF on purpose to hint that it's not (entirely) a plain text file. The PDF spec even recommends it:

Note: If a PDF file contains binary data, as most do (see Section 3.1, “Lexical Conventions”), it is recommended that the header line be immediately followed by acomment line containing at least four binary characters—that is, characters whose codes are 128 or greater. This will ensure proper behavior of file transfer applica- tions that inspect data near the beginning of a file to determine whether to treat the file’s contents as text or as binary.

In @mugen kenichi's answer, you can see the %íì¦" that attempts to trigger this.

Upvotes: 0

matthias krull
matthias krull

Reputation: 4429

You may have more success with the File::Type or File::LibMagic modules.

PDF is mostly plain text. Compression, images and encryption make them appear as binary. But simple PDFs are plain text to naive tests.

The minimal PDF from the specs in a simplyfied version is plain text:

%PDF-1.1
%íì¦"

1 0 obj
  << /Type /Catalog
     /Pages 2 0 R
  >>
endobj

2 0 obj
  << /Type /Pages
     /Kids [3 0 R]
     /Count 1
     /MediaBox [0 0 300 144]
  >>
endobj

3 0 obj
  <<  /Type /Page
      /Parent 2 0 R
      /Resources
       << /Font
           << /F1
               << /Type /Font
                  /Subtype /Type1
                  /BaseFont /Times-Roman
               >>
           >>
       >>
      /Contents [
        << /Length 105 >>
        stream
          BT
            /F1 18 Tf
            0 0 Td
            (Hello world.) Tj
          ET
        endstream ]
  >>
endobj

xref
0 4
0000000000 65535 f 
0000000019 00000 n 
0000000078 00000 n 
0000000179 00000 n 
trailer
  <<  /Root 1 0 R
      /Size 4
  >>
startxref
612
%%EOF

Upvotes: 2

tuxuday
tuxuday

Reputation: 3037

As @yvind Skaar pointed, try 'file' command.

Upvotes: -1

&#216;yvind Skaar
&#216;yvind Skaar

Reputation: 2328

A couple of suggestions:

  • Have you tried with a newer Perl? The docs call -T a "heuristic guess", maybe they improved it.
  • Kind of a hack, but you could try running 'file' on the files before opening them
  • Another hack: read the first line after open() to see it it really is text.

Don't know why it fails though.. do you have a publicly accessible pdf file that passes -T ?

Upvotes: -1

Related Questions