Fnkraf
Fnkraf

Reputation: 103

Powershell: Go through all files (PDF's) in a directory and move them based on what's written in the first 6 bytes

I am currently trying to write a powershell script that does the following:

Context: We have around 70k PDF-Files that cant be opened. After checking them with a certain tool, it looks like around 99% of those are damaged and the remaining 1% are zip files. The first bytes of a zipped PDF file start with "PK", the first bytes of a broken PDF-File start with PDF1.4 for example. I need to unzip all zip files and relocate them. Going through 70k PDF-Files by hand is kinda painful, so im looking for a way to automate it.

I know im supposed to provide a code sample, but the truth is that i am absolutely lost. I have written a few powershell scripts before, but i have no idea how to do something like this.

So, if anyone could kindly point me to the right direction or give me a useful function, i would really appreciate it a lot.

Upvotes: 2

Views: 5265

Answers (2)

Panomosh
Panomosh

Reputation: 892

You can use Get-Content to get your first 6 bytes as you asked. We can then tie that into a loop on all the documents and configure a simple if statement to decide what to do next, e.g. move the file to another dir

EDITED BASED ON YOUR COMMENT:

$pdfDirectory = 'C:\Temp\struktur_id_1225\ext_dok'
$newLocation = 'C:\Path\To\New\Folder'

Get-ChildItem "$pdfDirectory" -Filter "*.pdf" | foreach { 
    if((Get-Content $_.FullName | select -first 1 ) -like "%PDF-1.5*"){
        $HL7 = $_.FullName.replace("ext_dok","MDM")
        $HL7 = $HL7.replace(".pdf",".hl7")
        move $_.FullName $newLocation;
        move $HL7 $newLocation
    }
}

Try using the above, which is also a bit easier to edit.

$pdfDirectory will need to be set to the folder containing the PDF Files

$newLocation will obviously be the new directory!

And you will still need to change the -like "%PDF-1.5*" to suit your search!

It should do the rest for you, give it a shot

Another Edit

I have mimicked your folder structure on my computer, and placed a few PDF files and matching HL7 files and the script is working perfectly.

Upvotes: 2

Tobias KKS
Tobias KKS

Reputation: 150

Get-Content is not suited for PDF's, you'd want to use iTextSharp to read PDF's.

Download the iTextSharp(found in releases) and put the itextsharp.dll somewhere easy to find (ie. the folder your script is located in).

You can install the .nupkg by using Install-Package, or simply using an archive tool to extract the contents of the .nupkg file (it's basically a .zip file)

The code below adds every word on page 1 for each PDF separated by whitespace to an array. You can then test if the array contains your keyword

Add-Type -Path  "C:\path\to\itextsharp.dll"
$pdfs = Get-ChildItem "C:\path\to\pdfs"  *.pdf

foreach ($pdf in $pdfs) {
    $reader = New-Object itextsharp.text.pdf.pdfreader -ArgumentList $pdf.Fullname

        $text = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader,1).Split("")
        foreach($line in $text) {
           # do your test here
        }
    }

Upvotes: 0

Related Questions