Reputation: 805
I've been able to parse a PDF by page multiple ways, the latest being this (not my code):
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "oldy.pdf"
for ($page = 1; $page -le $reader.NumberOfPages; $page++)
{
$strategy = new-object 'iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy'
$currentText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page, $strategy);
[string[]]$Text += [system.text.Encoding]::UTF8.GetString([System.Text.ASCIIEncoding]::Convert( [system.text.encoding]::default, [system.text.encoding]::UTF8, [system.text.Encoding]::Default.GetBytes($currentText)));
}
I found a post here that suggested using LocationTextExtractionStrategy instead and splitting each line out by '\n' However, I will admit that the .NET code here is confusing me and i'm not sure how to modify it to parse by string.
Can anyone help?
thanks.
Upvotes: 0
Views: 12974
Reputation: 301
Only a first experiment, but it works as expected:
# Download http://sourceforge.net/projects/itextsharp/
Add-Type -Path itextsharp.dll
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList MyFile.pdf
for ($page = 1; $page -le $reader.NumberOfPages; $page++)
{
# extract a page and split it into lines
$text = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader,$page).Split([char]0x000A)
Write-Host "Page $($page) contains $($text.Length) lines. This is line 5:"
Write-Host $text[4]
#foreach ($line in $text)
#{
# any tasks
#}
}
$reader.Close()
Upvotes: 2