Carey Stephenson
Carey Stephenson

Reputation: 41

Trying to read a pdf, parse the data, and write desired data to spreadsheet using Perl on Linux

I am trying to extract data from credit card statements and enter it into a spreadsheet for tax purposes. What I've done so far involves multiple steps but I'm relatively new to Perl and am working from what I know. Here are two separate scripts I've written so far...one reads all data from a pdf and writes to a text file, the other parses the text (imperfectly) and writes it to another text file. Then I'd like to either create a csv file to import into a spreadsheet or write directly to a spreadsheet. I'd like to do this in one script but two or three will suffice.

first script:

#!/usr/bin/perl
use CAM::PDF; 
my $file = "/home/cd/Documents/Jan14.pdf"; 
my $pdf = CAM::PDF->new($file); 
my $doc="";
my $filename = 'report.txt';
open(my $fh, '>', $filename) or die "Could not open file '$filename' $!";
for ($i=1; $i <= $pdf->numPages(); $i++) {
 $doc = $doc.$pdf->getPageText($i);
}
print $fh " $doc\n";
close $fh;
print "done\n";

Second script:

#!/usr/bin/perl
use strict;
use warnings;

undef $/;               # Enable 'slurp' mode
open (FILE, '<', 'report.txt') or die "Could not open report.txt: $!";

my $file = <FILE>;      # Whole file here now... 
my ($stuff_that_interests_me) = 
     ($file =~ m/.*?(Date of Transaction.*?CONTINUED).*/s);
print "$stuff_that_interests_me\n";

my $filename = 'data.txt';
open(my $fh, '>>', $filename) or die "Could not open file '$filename' $!";

print $fh " $stuff_that_interests_me\n";
close $fh;
print "done\n";

close (FILE) or die "Could not close report.txt: $!";

open (FILE2, '<', 'report.txt') or die "Could not open report.txt: $!";

my $file2 = <FILE2>;      # Whole file here now... 
my ($other_stuff_that_interests_me) = 
     ($file2 =~ m/.*?(Page 2 .*?TRANSACTIONS THIS CYCLE).*/s);
print "$other_stuff_that_interests_me\n";
$filename = 'data.txt';
open($fh, '>>', $filename) or die "Could not open file '$filename' $!";

print $fh " $other_stuff_that_interests_me\n";
close $fh;
print "done\n";

close (FILE2) or die "Could not close report.txt: $!";

Update: I found a module (CAM:PDF) on CPAN that works great for what I'm trying to do...it even renders the data in a format that I can more easily use for my spreadsheet. However, I haven't yet figured out how to get it to print to a .txt file...any suggestions?

#!/usr/bin/perl -w

package main;

use warnings;
use strict;
use CAM::PDF;
use Getopt::Long;
use Pod::Usage;
use English qw(-no_match_vars);

our $VERSION = '1.60';

my %opts = (
            density    => undef,
            xdensity    => undef,
            ydensity    => undef,
            check      => 0,
            renderer   => 'CAM::PDF::Renderer::Dump',
            verbose    => 0,
            help       => 0,
            version    => 0,
            );

Getopt::Long::Configure('bundling');
GetOptions('r|renderer=s' => \$opts{renderer},
           'd|density=f'  => \$opts{density},
           'x|xdensity=f' => \$opts{xdensity},
           'y|ydensity=f' => \$opts{ydensity},
           'c|check'      => \$opts{check},
           'v|verbose'    => \$opts{verbose},
           'h|help'       => \$opts{help},
           'V|version'    => \$opts{version},
           ) or pod2usage(1);
if ($opts{help})
{
   pod2usage(-exitstatus => 0, -verbose => 2);
}
if ($opts{version})
{
   print "CAM::PDF v$CAM::PDF::VERSION\n";
   exit 0;
}

if (defined $opts{density})
{
   $opts{xdensity} = $opts{ydensity} = $opts{density};
}
if (defined $opts{xdensity} || defined $opts{ydensity})
{
   if (!eval "require $opts{renderer}")  ## no critic (StringyEval)
   {
      die $EVAL_ERROR;
   }
   if (defined $opts{xdensity})
   {
      no strict 'refs'; ## no critic(ProhibitNoStrict)
      my $varname = $opts{renderer}.'::xdensity';
      ${$varname} = $opts{xdensity};
   }
   if (defined $opts{ydensity})
   {
      no strict 'refs'; ## no critic(ProhibitNoStrict)
      my $varname = $opts{renderer}.'::ydensity';
      ${$varname} = $opts{ydensity};
   }
}

if (@ARGV < 1)
{
   pod2usage(1);
}

my $file = shift;
my $pagelist = shift;

my $doc = CAM::PDF->new($file) || die "$CAM::PDF::errstr\n";

foreach my $p ($doc->rangeToArray(1, $doc->numPages(), $pagelist))
{
   my $tree = $doc->getPageContentTree($p, $opts{verbose});
   if ($opts{check})
   {
      print "Checking page $p\n";
      if (!$tree->validate())
      {
         print "  Failed\n";
      }
   }
   $tree->render($opts{renderer});
}

Upvotes: 4

Views: 813

Answers (1)

Chankey Pathak
Chankey Pathak

Reputation: 21666

I'd like to either create a csv file to import into a spreadsheet or write directly to a spreadsheet.

You can write directly to the spreadsheet, check out Excel::Writer::XLSX.

If you want to create a CSV file then you can try using Text::CSV and Text::CSV_XS.

Upvotes: 3

Related Questions