Reputation: 140
For the project I am working on I have thousands of forms (.pdf) that I need to rename using the contents within the forms.
So far, I have run OCR on them and exported the content to text files. Each PDF form has a .txt file of the same name containing all its information. I would like to use powershell (if possible) to extract a specific part of the text file to rename the PDF file, but I am not sure how I can do that.
To give a better idea of what I'm working on, the form contained in the pdf and the text files (ex-12345.pdf and 12345.txt) looks something like this-
~~~
constituency: xxxyyyzzz
polling station: abc def ghi (001)
stream: 123
~~~
What I need to do is to extract the polling station name and rename the pdf file to that.
"12345.pdf" -> "abc_def_ghi_(001).pdf"
So I need to figure out how to extract the string between "station:" and "stream:" from 12345.txt. But to make things a bit more complicated, the text files I want to extract the string from have some irregularities when it comes to spacing.
for example, the previous form may look like this in the text file-
~~~
constit uency: xxxyyyzzz
polling stat i on: abc de f ghi (00 1)
s tream: 12 3
~~~
Fortunately, the letters themselves seem to be intact.
So, I would like to learn how to extract the string containing the polling station name from these text files and rename the corresponding pdf files with it.
Thanks for your help.
Upvotes: 0
Views: 654
Reputation: 10044
With the assumption that you know the line spacing is the same on each "polling station" line, you could just remove all of the spaces. Then trim off the irrelevant parts, then format your line with substring()
methods.
$Text = 'constit uency: xxxyyyzzz
polling stat i on: abc de f ghi (00 1)
stream: 12 3'
$trimmed = $text -replace "\s",'' -replace '^.*pollingstation:','' -replace "stream:.*$",''
"$($trimmed.substring(0,3))_$($trimmed.substring(3,3))_$($trimmed.substring(6,3))_$($trimmed.substring(9,5)).pdf"
#Output: 'abc_def_ghi_(001).pdf'
Upvotes: 1
Reputation: 24585
'polling station: abc def ghi (001)' |
Select-String ' station: (.+)' |
ForEach-Object { "{0}.pdf" -f ($_.Matches[0].Groups[1].Value -replace ' ','_') }
# outputs 'abc_def_ghi_(001).pdf'
Upvotes: 1