John O
John O

Reputation: 5423

Saving PDF files with WWW::Mechanize corrupts them

I'm trying to write a script that will log into Bank of America and download PDF statements. I've manage all the difficult tricks, and I'm hung up on saving the PDF files. I've tried both the ':content_file' => "some file path" method, and a $mech->save_content("same file path") . Usually, either of these work fine (even for PDFs). A typical BoA PDF statement is 4 pages long, and about 400k in size.

If I use the former method, it truncates the file to 33k, and it's unopenable by Preview on the Mac (but I can see the PDF header and EPS binary gibberish in Sublime). If I use the latter method, it saves the file with 95 extra bytes (compared to downloading it in Chrome) which somehow screws up the second page (of 4). The only visually obvious difference is that the Mechanize-downloaded file has an extra line containing the character '0' and a few newlines at the end. diff reports "Binary files 2014-06-19 Statement.pdf and eStmt_2014-06-19.pdf differ". I have no idea how to determine the remaining 92 bytes of difference.

Oooh, found something: using save_content(), every few hundred lines in the PDF, I get a newline, the string "8000", and another trailing newline... then the binary picks up again. Not sure what that is. Looks like there are 10 instances of this (so that accounts for another 50 of the extra bytes).

Does anyone have any idea what could be going on here?

I have the following code:

#!/usr/bin/perl
use strict;

use WWW::Mechanize;
use Date::Parse;
use DateTime;
use File::Path;

########################################################################################################################
#                Change only the configuration settings in this section, nothing above or below it.                    #
########################################################################################################################

# Credentials
my $username = "someusername";
my $password = "somepassword";

# Enclose value in double quotes, folders with spaces in the name are ok.
my $root_folder = "/Users/john/Documents/Important/Credit Card Statements";

########################################################################################################################
########################################################################################################################

# Suddenly web robot.
my $mech = WWW::Mechanize->new();
$mech->agent_alias('Mac Safari');

# First we have to log in.
$mech->get("https://www.bankofamerica.com/");

# Login, blah.
$mech->submit_form(
  form_name => 'frmSignIn',
  fields  => { Access_ID => $username },
);

# Dumb thing uses a meta refresh...
$mech->follow_link(url_regex => qr/signOn\.go/);

# This is what they call two factor authentication. Heh.
$mech->submit_form(
  form_name => 'ConfirmSitekeyForm',
  fields  => { password => $password },
);

# Just the single account for now... maybe make this a loop later?
#for my $link ($mech->find_all_links(url_regex => qr/redirect\.go.+?target=acctDetails/)) {
$mech->follow_link(url_regex => qr/redirect\.go.+?target=acctDetails/);

# We need the last four digits, easiest here.
my ($fourdigits) = $mech->content() =~ /<span class="bold TL_NPI_AcctName">.+? - (\d{4})</;

# Go to the account details page... 
$mech->follow_link(url_regex => qr/redirect\.go.+?target=statements/);

# Now we need to select which documents we want...
# I'm assuming that you're running this daily in cron. Therefor, we're only going to search the last 60 days.
my $mech2 = $mech->clone();

$mech2->submit_form(
  form_name => 'statementsAndDocTab',
  fields  => { docItemSelected   => 'All',
               dateRangeSelected => '60D',
               selectedDocCode   => 'All',
               selectedDateRange => '60D',
             },
);

# These are nasty javascripty links. I think I have to post to this damn thing, to get a pdf response back. Need to
# regex-loop.
my $page = $mech2->content();
while ($page =~ /id="hidden-documentId\d+" value="(\d+)" name="statement-name".+?onclick="docInboxModuleAccountSkin.downloadLayerSubmit\(this,'downloadPdf','(.+?)', '(.+?)','([0-9\/]+)','(.+?)'/gs) {
    my $documentId = $1;
    my $actionurl = "https://secure.bankofamerica.com" . $2 . "&nocache=" . sprintf("%05d", int(rand(100000)));
    my $docName = $3;
    my $boadate = $4;
    my $documentTypeId = $5;
    my $year = DateTime->from_epoch(epoch => str2time($boadate))->year;
    my $date = DateTime->from_epoch(epoch => str2time($boadate))->ymd;

    # There are more than just statements here. What do we name the files?
    my $filename;
    if    ($docName =~ m/Change in Terms/i) { $filename = "$date Change in Terms.pdf"; }
    elsif ($docName =~ m/Statement/i)       { $filename = "$date Statement.pdf"; }
    else                                    { $filename = "$date Unknown.pdf"; }

    # We may need to create a folder for the year...
    File::Path::make_path("$root_folder/Bank of America - $fourdigits/$year");

    # Get the file.
    unless (-f "$root_folder/Bank of America - $fourdigits/$year/$filename") {
        my $pdf = $mech2->clone();
        # Normally we'd just do $pdf->get(), but we need to do a submit_form. Unfortunately, the form doesn't exist,
        # javascript creates it in place. Ugh.
        $pdf->post( $actionurl,
         #           ':content_file' => "$root_folder/Bank of America - $fourdigits/$year/$filename",
                    [ documentId     => $documentId,
                      menu           => 'downloadPdf',
                      viewDownload   => 'downloadPdf',
                      date           => $boadate,
                      docName        => $docName,
                      documentTypeId => $documentTypeId,
                      version        => '',
                    ],
        );

        $pdf->save_content("$root_folder/Bank of America - $fourdigits/$year/$filename");

        # Let's do a notification...
        #system("/usr/local/bin/terminal-notifier -message \"Bank of America document dated $date has been downloaded.\" -title \"Statement Retrieved\" ");
    }
}

Upvotes: 3

Views: 1059

Answers (1)

Sobrique
Sobrique

Reputation: 53478

From a quick look at the save_content method in the WWW:Mechanize documentation, the thing that might be worth trying is:

$mech->save_content( $filename, binary => 1 );

The problem you describe is similar to the sort you get when saving binary data in ascii mode.

Upvotes: 1

Related Questions