Reputation: 2868
I have a process that compresses PDF files that our secretaries create by scanning signed documents at a multi-function printer.
On rare occasions, these files cannot be opened in Acrobat reader after being compressed. I don't know why this is happening rarely, so I'd like to be able to test the PDF post-compression and see if it is "good".
I am trying to use itextsharp 5.1.1 to accomplish this, but it happily loads the PDF. My best guess is that Acrobat reader fails when it's trying to display the picture.
Any ideas on how I can tell if the PDF will render?
Upvotes: 9
Views: 12358
Reputation: 1
PdfCpu works great. relaxed example:
pdfcpu validate goggles.pdf
Strict example:
pdfcpu validate -m strict goggles.pdf
https://pdfcpu.io/core/validate
Upvotes: 2
Reputation: 1692
qpdf will be of great help for your needs:
apt-get install qpdf
qpdf --check filename.pdf
example output:
checking filename.pdf
PDF Version: 1.4
File is not encrypted
File is not linearized
WARNING: filename.pdf: file is damaged
WARNING: filename.pdf (object 185 0, file position 1235875): expected n n obj
WARNING: filename.pdf: Attempting to reconstruct cross-reference table
WARNING: filename.pdf: object 185 0 not found in file after regenerating cross reference table
operation for Dictionary object attempted on object of wrong type
Upvotes: 1
Reputation: 9
I've used "pdfinfo.exe" from xpdfbin-win package and cpdf.exe to check PDF files for corruption, but didn't want to involve a binary if it wasn't necessary.
I read that newer PDF formats have a readable xml data catalog at the end, so I opened the PDF with regular windows NOTEPAD.exe and scrolled down past the unreadable data to the end and saw several readable keys. I only needed one key, but chose to use both CreationDate and ModDate.
The following Powershell (PS) script will check ALL the PDF files in the current directory and output the status of each into a text file (!RESULTS.log). It took about 2 minutes to run this against 35,000 PDF files. I tried to add comments for those who are new to PS. Hope this saves someone some time. There's probably a better way to do this, but this works flawlessly for my purposes and handles errors silently. You might need to define the following at the beginning: $ErrorActionPreference = "SilentlyContinue" if you see errors on screen.
Copy the following into a text file and name it appropriately (ex: CheckPDF.ps1) or open PS and browse to the directory containing the PDF files to check and paste it in the console.
#
# PowerShell v4.0
#
# Get all PDF files in current directory
#
$items = Get-ChildItem | Where-Object {$_.Extension -eq ".pdf"}
$logFile = "!RESULTS.log"
$badCounter = 0
$goodCounter = 0
$msg = "`n`nProcessing " + $items.count + " files... "
Write-Host -nonewline -foregroundcolor Yellow $msg
foreach ($item in $items)
{
#
# Suppress error messages
#
trap { Write-Output "Error trapped"; continue; }
#
# Read raw PDF data
#
$pdfText = Get-Content $item -raw
#
# Find string (near end of PDF file), if BAD file, ptr will be undefined or 0
#
$ptr1 = $pdfText.IndexOf("CreationDate")
$ptr2 = $pdfText.IndexOf("ModDate")
#
# Grab raw dates from file - will ERR if ptr is undefined or 0
#
try { $cDate = $pdfText.SubString($ptr1, 37); $mDate = $pdfText.SubString($ptr2, 31); }
#
# Append filename and bad status to logfile and increment a counter
# catch block is also where you would rename, move, or delete bad files.
#
catch { "*** $item is Broken ***" >> $logFile; $badCounter += 1; continue; }
#
# Append filename and good status to logfile
#
Write-Output "$item - OK" -EA "Stop" >> $logFile
#
# Increment a counter
#
$goodCounter += 1
}
#
# Calculate total
#
$totalCounter = $badCounter + $goodCounter
#
# Append 3 blank lines to end of logfile
#
1..3 | %{ Write-Output "" >> $logFile }
#
# Append statistics to end of logfile
#
Write-Output "Total: $totalCounter / BAD: $badCounter / GOOD: $goodCounter" >> $logFile
Write-Output "DONE!`n`n"
Upvotes: 0
Reputation: 2868
OK, what I ended up doing was using itextsharp to loop through all of the stream objects and check their length. The error condition I had was that the length would be zero. This test seems quite reliable. It may not work for everyone, but it worked in this particular situation.
Upvotes: 4
Reputation: 28110
In similar situations in the past I have successfully used the PDF Toolkit (a/k/a pdftk) to repair bad PDFs with a command like this: pdftk broken.pdf output fixed.pdf
.
Upvotes: 4