Reputation: 31
I have a pdf document that I would like to extract content out of. The issue I am having is this... I search for the IMEI keyword, and it finds it, but I need the actual IMEI value which is the next item in the loop.
In the PDF the value looks like this: IMEI 90289393092
returning value via the below script: -0.1 -8.8 9.8 -0.1 446.7 403.9 Tm (IMEI:) Tj
I only want to have the value: 90289393092
Script I am using:
Add-Type -Path .\itextsharp.dll
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "$pwd\PDF\DOC001.pdf"
for ($page = 1; $page -le $reader.NumberOfPages; $page++) {
$lines = [char[]]$reader.GetPageContent($page) -join "" -split "`n"
foreach ($line in $lines) {
if ($line -match "IMEI") {
$line = $line -replace "\\([\S])", $matches[1]
$line -replace "^\[\(|\)\]TJ$", "" -split "\)\-?\d+\.?\d*\(" -join ""
}
}
}
Upvotes: 2
Views: 12606
Reputation: 60938
this is the way for using itextsharp.dll and read a pdf as plain text:
Add-Type -Path .\itextsharp.dll
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList c:\ps\a.pdf
for ($page = 1; $page -le $reader.NumberOfPages; $page++)
{
$strategy = new-object 'iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy'
$currentText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader, $page, $strategy);
[string[]]$Text += [system.text.Encoding]::UTF8.GetString([System.Text.ASCIIEncoding]::Convert( [system.text.encoding]::default , [system.text.encoding]::UTF8, [system.text.Encoding]::Default.GetBytes($currentText)));
}
$Reader.Close();
And this can be the regex you need but I haven't tested it
[regex]::matches( $text, '(?<=IMEI\s+)(\d+)(?=\s+)' ) | select -expa value
Upvotes: 4