Reputation: 1212
I'm trying to limit my Perl-Tk code to only open text files to edit. I'm testing to make sure the user selected a valid file (I'm using Tks getOpenFile()
):
if ( (defined $file) and (-f $file) and (-T $file) ) {
#work with file
}
The problem I've run into is that some PDF files pass the -T test and get opened (causing much chaos). I tried this code in a directory full of PDFs:
#!/usr/bin/perl
use strict;
use warnings;
my @files = <*>;
foreach (@files) {
if (-T) { print "$_ is a text file\n"};
}
About 1/2 the PDFs in the directory get printed.
Am I using -T wrong? Will I have to add a regex to filter out PDFs? And how come Perl thinks only some of the PDFs are text?
EDIT: -T
is a file test that should return true if the file is plain text. I'm not trying to check for taint.
Upvotes: 1
Views: 1942
Reputation: 126722
You are using -T
correctly: it is just a best guess rather than an absolute classification. It may help to know that PDF files carry a fourcc of %PDF
which you can check easily with a subroutine like this
sub isPDF {
open my $fh, '<', shift or return;
read $fh, my $fourcc, 4;
return $fourcc eq '%PDF';
}
Upvotes: 2
Reputation: 8963
Most PDFs have a few binary characters right after the %PDF on purpose to hint that it's not (entirely) a plain text file. The PDF spec even recommends it:
Note: If a PDF file contains binary data, as most do (see Section 3.1, “Lexical Conventions”), it is recommended that the header line be immediately followed by acomment line containing at least four binary characters—that is, characters whose codes are 128 or greater. This will ensure proper behavior of file transfer applica- tions that inspect data near the beginning of a file to determine whether to treat the file’s contents as text or as binary.
In @mugen kenichi's answer, you can see the %íì¦"
that attempts to trigger this.
Upvotes: 0
Reputation: 4429
You may have more success with the File::Type or File::LibMagic modules.
PDF is mostly plain text. Compression, images and encryption make them appear as binary. But simple PDFs are plain text to naive tests.
The minimal PDF from the specs in a simplyfied version is plain text:
%PDF-1.1
%íì¦"
1 0 obj
<< /Type /Catalog
/Pages 2 0 R
>>
endobj
2 0 obj
<< /Type /Pages
/Kids [3 0 R]
/Count 1
/MediaBox [0 0 300 144]
>>
endobj
3 0 obj
<< /Type /Page
/Parent 2 0 R
/Resources
<< /Font
<< /F1
<< /Type /Font
/Subtype /Type1
/BaseFont /Times-Roman
>>
>>
>>
/Contents [
<< /Length 105 >>
stream
BT
/F1 18 Tf
0 0 Td
(Hello world.) Tj
ET
endstream ]
>>
endobj
xref
0 4
0000000000 65535 f
0000000019 00000 n
0000000078 00000 n
0000000179 00000 n
trailer
<< /Root 1 0 R
/Size 4
>>
startxref
612
%%EOF
Upvotes: 2
Reputation: 2328
A couple of suggestions:
Don't know why it fails though.. do you have a publicly accessible pdf file that passes -T ?
Upvotes: -1