Mandar Pande
Mandar Pande

Reputation: 12974

Why am I unable to parse non-proportional text using CAM::PDF?

While parsing page no. 22 of http://sfdoccentral.symantec.com/sf/5.1/linux/pdf/vxfs_admin.pdf, I am able to parse all the words except mount_vxfs as its encoding style and/or font is different than normal plain text. Please find attached PDF Page for details.

Please find my code :-

`#!/usr/bin/perl
use CAM::PDF;
my $file_name="vxfs_admin_51sp1_lin.pdf";
my $pdf = CAM::PDF ->new($file_name);
my $no_pages=$pdf->numPages();
print "$no_pages\n";
for(my $i=1;$i<$no_pages;$i++){
my $page = $pdf->getPageText($i);
//for page no. 22
//if($i==22){ 
print $page;
//}
}`

Upvotes: 0

Views: 369

Answers (1)

djgp
djgp

Reputation: 26

PDF doesn't store the semantic text that you read but rather uses character codes which map to glyphs (the painted characters) in a particular font. Often, however, the code-glyph mapping matches common character sets (such as ISO-8859-1 or UTF-8) so that the codes are human-readable. That's the case for all of the text you have been able to parse, although sometimes the odd character, mostly punctuation, is also "wrong".

The text for "mount_vxfs" in your document is encoded completely differently, unfortunately, resulting in apparent garbage. If you're curious, you can see what's really there by substituting getPageText() with getPageContent() in your code.

In order to convert the PDF text back to meaningful characters, PDF readers have to jump through hoops with a number of conversion tables (including the so-called CMaps). Because this is a lot of programming work, many simpler libraries opt not to implement them. That's the case with CAM::PDF.

If you're just interested in parsing the text (not editing it), the following technique is something I use with success:

  1. Obtain xpdf (http://foolabs.com/xpdf) or Poppler (http://poppler.freedesktop.org/). Poppler is a newer fork of xpdf. If you're using *nix, there will be a package available.

  2. Use the command-line tool 'pdftotext' to extract the text from a file, either page-wise or all at once.

Example:

#!/usr/bin/perl
use English;
my $file_name="vxfs_admin.pdf";

open my $text_fh, "/usr/bin/pdftotext -layout -q '$file_name' - 2>/dev/null |";
local $INPUT_RECORD_SEPARATOR = "\f";    # slurp a whole page at a time
while (my $page_text = <$text_fh>) {
    # this is here only for demo purposes
    print $page_text if $INPUT_LINE_NUMBER == 19;
}
close $text_fh;

(Note: The document I retrieved using your link is slightly different; the offending bit is on page 19 instead.)

Upvotes: 1

Related Questions