Reputation: 65
I've been using the CAM::PDF module to try editing pdf docs at work - essentially just trying to change the date on docs automatically to show they have been reviewed recently
unfortunately, despite my code telling me that I am making changes to the PDF objects ($pdf->{changes}) and giving the pdfs the doc is attempting to change maximum accessibility (anyone can access, read, write) the pdf's outputted never seem to materialise with these changes. I have also been grepping the object node tmp files I output on mass and found that all of these show no sign of the old date after running the code; yet when I view the pdf after running it, the old date is still on the pdf. Has anyone encountered this before or can suggest anything?
just doing this manually isn't an option; I want to script this so I can have a script I just run against multiple files at once (I have LOTS of these files to sort out at work) but other than changing dates written on the doc, the doc has to remain looking the sameish (by which I mean, it would be ok if they changed in size a little but not ok if they completely changed in appearance)
I strictly followed the example changepdfstring.pl (https://metacpan.org/pod/distribution/CAM-PDF/bin/changepdfstring.pl) from the author of the module CAM::PDF on how to do this for my code, then tried different variations of it to try and get things to work - so I'm bemused that nothing has worked in the end
#!/usr/bin/perl
use strict;
use warnings;
use CAM::PDF;
use Data::Dumper;
my $pdf = CAM::PDF->new('Order fulfilment process flowchart.pdf');
if (!$pdf->canModify())
{
die "This PDF forbids modification\n";
}
my $olddate = "15.02.2019";
my $newdate = "22.02.2022";
foreach my $objectnumber (keys %{$pdf->{xref}}){
my $objectnode = $pdf->dereference($objectnumber);
$pdf->changeString($objectnode, {$olddate=>$newdate});
}
my $change = $pdf->{changes};
print Dumper($change);
my $count = 0;
foreach my $objectnumber (keys %{$pdf->{xref}}){
my $objectnode = $pdf->dereference($objectnumber);
$count++;
open (ONO, ">tmp.objectnode.$count");
print ONO Dumper($objectnode);
close (ONO);}
if (!scalar %{$pdf->{changes}})
{
die "no changes were made :(";
}
$pdf->preserveOrder();
$pdf->cleanoutput('pleasework.pdf');
Any help or advice would be greatly appreciated
Upvotes: 1
Views: 464
Reputation: 65
I found that the line I was trying to edit was not actually a contiguous set of characters in the pdf, but rather it was inside a TJ operator in a BT line in the PDF. I cannot see any provision for handling cases where the desired text is in TJ lines in the CAM::PDF library (although perhaps there is @ChrisDolan ?) hence it was unable to be operated on nor "swapped out" by CAM::PDF. After decompressing all the streams (where applicable) I found this 'TJ' line which had the text I wished to operate on:
[(D)-20(a)24(t)62(e)-46(:)86( )-46(1)52(5)-37(.)70(0)-37(2)52(.)-20(2)52(0)-37(1)52(9)] TJ
I don't believe it would have been possible for CAM::PDF to act on TJ lines, perhaps it can only act on Tj lines
For anyone looking for a quick answer to this same problem, this "dirty" script worked for me in this case:
#!/usr/bin/perl
use strict;
use Compress::Raw::Zlib;
use bytes;
open(OUT,'>', "newfromoldscript.pdf");
my $fname = 'Order fulfilment process flowchart.pdf';
open(FILE, '<:raw', $fname) || die("can't open($fname): $!");
$/ = undef;
my $file = <FILE>;
my $file_len = length($file);
my $i = 0;
my $offset;
my $offset;
my $o;
do {
$o = doX(substr($file, $offset, $file_len), $i);
$offset+=$o;
$i++;
} while($o && $i< 100);
sub doX {
my $file = shift;
my $i = shift;
my $stream = index($file, "\nstream");
if ($stream < 0) {
print OUT $file;
return 0;
}
$stream++;
my $deflate = 1;
my $line_before = rindex(substr($file,0,$stream), "<<");
print OUT substr($file,0,$line_before);
my $x = substr($file, $line_before,$stream-$line_before);
if ($i == 22) {
print "";
}
my $stream_len;
if ($x =~ /FlateDecode\/Length (\d+)>>/) {
$stream_len = $1;
}
if ($x =~ /FlateDecode\/Length (\d+)\//) {
print "Warn Object $i has len/len what the even is this?\n";
$stream_len = $1;
}
if ($x =~ /XML\/Length (\d+)>>/) {
$deflate = 0;
$stream_len = $1;
}
if (!$stream_len) {
die("I fail with no stream len : $x");
}
print "-->$line_before,$i,$stream=$stream_len=$x<--\n";
my $bytes = substr($file, $stream+8,$stream_len);
my $orig_bytes = $bytes; # inflate seems to mangle bytes, so take a copy
my $o;
my $d=new Compress::Raw::Zlib::Inflate();
if ($deflate) {
$d->inflate($bytes,$o);
} else {
$o = $bytes;
}
my $orig_x = $x;
my $changes;
my %change = (
'-20(2)52(0)-37(.)52(.)' => '-20(2)52(0)-37(2)52(0)', #trialling different reg ex's here
'-37(1)52(9)'=>'-37(2)52(0)', #reg ex's
'Date: 15.02.2019'=>'Date: 12.02.2020',
'[(A)[\d-]+(p)[\d-]+(p)[\d-]+(r)[\d-]+(o)[\d-]+(ve)[\d-]+(d)[\d-]+( )[\d-]+(B[^\]]+\] TJ'=>'(Approved By: George W) Tj??G-TAG??' #scrap the whole TJ, replace for Tj
);
foreach my $re (keys %change) {
my $to = $change{$re};
$re =~ s/([\(\)])/\\\1/g; # escape round brackets
print $re;
open (GW, ">tmp.gw");
print GW $re;
close (GW);
if ($o=~/$re/m) {
$o =~ s/$re/$to/mg;
print $o;
$changes++;
}
}
if ($changes) {
print "\n MADE CHANGES\n";
#split, get rid of the ? mark tag
my @remains = split('\?\?G-TAG\?\?', $o);
my $firsthalf = $remains[0];
my $secondhalf = $remains[1];
#reverse the string
$firsthalf = scalar reverse ($firsthalf);
if ($firsthalf =~ m/fT 52\.8 2F/){print "FOUND THE REVERSE"}
$firsthalf =~ s/fT 52\.8 2F/fT 52\.8 0F/;
#reg ex to back track to the nearest and thus relevant Font/F and set it to F0
#put it back in correct orientation
$firsthalf = scalar reverse ($firsthalf);
$o = join("", $firsthalf, $secondhalf);
open (WEIRD, ">tmp.weird");
print WEIRD $firsthalf;
close (WEIRD);
$changes++;
my $d = new Compress::Raw::Zlib::Deflate();
my $obytes;
my $obytes2;
my $status = $d->deflate($o, $obytes);
$d->flush($obytes2);
$bytes = $obytes . $obytes2;
if (length($bytes) != $stream_len) {
my $l = length($bytes);
print "-->$x<--\n";
warn("what do we do here $l != $stream_len");
$orig_x =~ s/$stream_len/$l/;
}
print OUT $orig_x . "stream\r\n";
print OUT $bytes . "\r";
} else {
print OUT $orig_x . "stream\r\n";
print OUT $orig_bytes . "\r";
}
open(TMP,">out/tmp.$i.bytes");
print TMP $o;
close(TMP);
return $stream + 8 + $stream_len + 1;
}
Essentially I swap out the TJ for a Tj for changing someone elses name on the document to my name, which makes it simpler to insert my change (but potentially messy). To enable this to display with capitalised letters, I had to reverse the string and swap out the font (F) it was under (F2) to F0
For the TJ line relating to date, I swapped out the TJ characters for the date I wished to change it to, this meant I had to abide by the "unfriendly" syntax TJ operator lines abide by
Upvotes: 0
Reputation: 8963
I'm the author of CAM::PDF. Without seeing the PDF, I can only guess but I'd wager that the problem is that $olddate
simply doesn't match any text in the doc. Kerning can break strings into multiple parts, for example. Also, there are several different ways to encode strings that appear the same in the resulting doc. So the trick for you will be figuring out what is the pattern for the dates in your specific docs.
That said, I also like the clever idea that @Bruce Ramos offered in a separate answer. That approach won't change a date that's visible in the rendered PDF (like, if you print it) but it should show up as metadata in nearly any PDF viewer.
Upvotes: 3
Reputation: 146
A quick search in page 145 of the PDF specification[1] shows that there are 2 metadata fields that should allow a simple change to achieve what you are trying to do.
Below you can find a quick script using CAM::PDF to set/modify the ModDate with the current date, thus giving the illusion of "modifying" the PDF.
The script can, if needed, be amended to use a specific date instead of the current time to set the modification date.
Please note that I'm not sure that CAM::PDF is the best option to get this task done.
The script is a only a sample of what can be done within the limitations and simplicity of CAM::PDF.
[1] https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf
#!/usr/bin/env perl
use strict;
use warnings;
use Time::Local;
use CAM::PDF;
use CAM::PDF::Node;
my $infile = shift || die 'syntax...';
my $outfile = shift || die 'syntax...';
my $pdf = CAM::PDF->new($infile) || die;
my $info = $pdf->getValue($pdf->{trailer}->{Info});
if ($info) {
my @time = localtime(time);
my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = @time;
$year += 1900;
$mon++;
my $gmt_offset_in_seconds = timegm(@time) - timelocal(@time);
my $gmt_offset_min = ($gmt_offset_in_seconds / 60) % 60;
my $gmt_offset_hour = abs(int($gmt_offset_in_seconds / (60*60)));
my $offset_char = "";
if ($gmt_offset_in_seconds < 0) {
$offset_char = "-";
} else {
$offset_char = "+";
}
my $date = sprintf("D:%04d%02d%02d%02d%02d%02d%s%02d'%02d'", $year, $mon, $mday, $hour, $min, $sec, $offset_char, $gmt_offset_hour, $gmt_offset_min);
my $objnum = undef;
for my $obj ('Creator', 'Producer', 'CreationDate') {
if (exists $info->{$obj} and exists $info->{$obj}->{objnum}) {
$objnum = $info->{$obj}->{objnum};
last;
}
}
die "Cannot find objnum, halting..." if not defined $objnum;
my $mod_date = $info->{ModDate};
if ($mod_date) {
$mod_date->{value} = $date;
} else {
my $mod_date = new CAM::PDF::Node('string',$date);
$mod_date->{gennum} = 0;
$mod_date->{objnum} = $objnum;
$info->{ModDate} = $mod_date;
}
$pdf->preserveOrder();
$pdf->cleanoutput($outfile);
} else {
print "Cannot find PDF info section, doing nothing!\n";
}
Upvotes: 3