Parsing PDF by line

Question

I've been able to parse a PDF by page multiple ways, the latest being this (not my code):

$reader = New-Object iTextSharp.text.pdf.pdfreader  -ArgumentList "oldy.pdf"

for ($page = 1; $page -le $reader.NumberOfPages; $page++)
{
    $strategy = new-object  'iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy'            
    $currentText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page, $strategy);
    [string[]]$Text += [system.text.Encoding]::UTF8.GetString([System.Text.ASCIIEncoding]::Convert( [system.text.encoding]::default, [system.text.encoding]::UTF8, [system.text.Encoding]::Default.GetBytes($currentText)));
}

I found a post here that suggested using LocationTextExtractionStrategy instead and splitting each line out by ' ' However, I will admit that the .NET code here is confusing me and i'm not sure how to modify it to parse by string.

Can anyone help?

thanks.

Parsing PDF by line

Answers (1)

Related Questions