fodelement
fodelement

Reputation: 11

Extract Pages from a PDF using itextsharp in Powershell

I have been researching this for weeks now and can't seem to make much ground on the subject. I have a large PDF (900+ pages), that is the result of a mail merge. The result is 900+ copies of the same document which is one page, with the only difference being someone's name on the bottom. What I am trying to do, is have a powershell script read the document using itextsharp and save pages that contain a specific string (the person's name) into their respective folder.

This is what I have managed so far.

Add-Type -Path C:\scripts\itextsharp.dll

$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList 
"$pwd\downloads\TMs.pdf"
for($page = 1; $page -le $reader.NumberOfPages; $page++) {


    $pageText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader,$page).Split([char]0x000A)

    if($PageText -match 'DAN KAGAN'){
    Write-Host "DAN FOUND"
    }
    }

As you can see I am only using one name for now for testing. The script finds the name properly 10 times. What I cannot seem to find any information on, is how to extract pages that this string appears on.

I hope this was clear. If I can be of any help, please let me know.

Thanks!

Upvotes: 1

Views: 7395

Answers (1)

Bacon Bits
Bacon Bits

Reputation: 32180

I actually just finished writing a very similar script. With my script, I need to scan a PDF of report cards, find a student's name and ID number, and then extract that page and name it appropriately. However, each report card can span multiple pages.

It looks like you're using iTextSharp 5, which is good because so am I. iTextSharp 7's syntax is wildly different and I haven't learned it yet.

Here's the logic that does the page extraction, roughly:

    $Document = [iTextSharp.text.Document]::new($PdfReader.GetPageSizeWithRotation($StartPage))
    $TargetMemoryStream = [System.IO.MemoryStream]::new()
    $PdfCopy = [iTextSharp.text.pdf.PdfSmartCopy]::new($Document, $TargetMemoryStream)

    $Document.Open()
    foreach ($Page in $StartPage..$EndPage) {
        $PdfCopy.AddPage($PdfCopy.GetImportedPage($PdfReader, $Page));
    }
    $Document.Close()

    $NewFileName = 'Elementary Student Record - {0}.pdf' -f $Current.Student_Id
    $NewFileFullName = [System.IO.Path]::Combine($OutputFolder, $NewFileName)
    [System.IO.File]::WriteAllBytes($NewFileFullName, $TargetMemoryStream.ToArray())

Here is the complete working script. I've removed as little as possible to provide you a near working example:

Import-Module -Name SqlServer -Cmdlet Invoke-Sqlcmd
Add-Type -Path 'C:\...\itextsharp.dll'

# Get table of valid student IDs
$ServerInstance = '...'
$Database = '...'
$Query = @'
select student_id, student_name from student
'@
$ValidStudents = @{}
Invoke-Sqlcmd -Query $Query -ServerInstance $ServerInstance -Database $Database -OutputAs DataRows | ForEach-Object {
    [void]$ValidStudents.Add($_.student_id.trim(), $_.student_name)
}

$PdfFiles = Get-ChildItem "G:\....\*.pdf" -File |
    Select-Object -ExpandProperty FullName
$OutputFolder = 'G:\...'

$StudentIDSearchPattern = '(?mn)^(?<Student_Id>\d{6,7}) - (?<Student_Name>.*)$'
foreach ($PdfFile in $PdfFiles) {
    $PdfReader = [iTextSharp.text.pdf.PdfReader]::new($PdfFile)

    $StudentStack = [System.Collections.Stack]::new()

    # Map out the PDF file.
    foreach ($Page in 1..($PdfReader.NumberOfPages)) {
        [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($PdfReader, $Page) |
            Where-Object { $_ -match $StudentIDSearchPattern } |
            ForEach-Object {
            $StudentStack.Push([PSCustomObject]@{
                    Student_Id   = $Matches['Student_Id']
                    Student_Name = $Matches['Student_Name']
                    StartPage    = $Page
                    IsValid      = $ValidStudents.ContainsKey($Matches['Student_Id'])
                })
        }
    }

    # Extract the pages and save the files
    $LastPage = $PdfReader.NumberOfPages
    while ($StudentStack.Count -gt 0) {
        $Current = $StudentStack.Pop()

        $StartPage = $Current.StartPage
        $EndPage = $LastPage

        $Document = [iTextSharp.text.Document]::new($PdfReader.GetPageSizeWithRotation($StartPage))
        $TargetMemoryStream = [System.IO.MemoryStream]::new()
        $PdfCopy = [iTextSharp.text.pdf.PdfSmartCopy]::new($Document, $TargetMemoryStream)

        $Document.Open()
        foreach ($Page in $StartPage..$EndPage) {
            $PdfCopy.AddPage($PdfCopy.GetImportedPage($PdfReader, $Page));
        }
        $Document.Close()

        $NewFileName = 'Elementary Student Record - {0}.pdf' -f $Current.Student_Id
        $NewFileFullName = [System.IO.Path]::Combine($OutputFolder, $NewFileName)
        [System.IO.File]::WriteAllBytes($NewFileFullName, $TargetMemoryStream.ToArray())

        $LastPage = $Current.StartPage - 1
    }
}

In my test environment this processes about 500 students across 5 source PDFs in about 15 seconds.

I tend to use constructors instead of New-Object, but there's no real difference between them. I just find them easier to read.

Upvotes: 4

Related Questions