Reputation: 37
I have 7000 pdf docs in folder "ffl" they all have been through ocr program so contents can be copy and pasted.
Each document contains text "license -----*****" The number is 15 digits long has dashes and the 10th item is a letter.
Need a batch to rename all files by the license number in the document.
Is there a script I can run to accomplish this? Have been searching for about a week. Everything is talking about the new way to rename from finder. Nothing on renaming from contents of doc. Pretty new to terminal.
I have seen the basic command for renaming mv "old location" "new location"
mv /home/user/my_static /home/user/static
Right now I copy the number and paste as file name. Need a faster way.
Please and thank you for any advise.
Upvotes: 1
Views: 4673
Reputation: 844
I had a similar problem where I wanted to rename a bunch of pdf files with content extracted from the pdf file (in that case a date). I tried at first to do a bash only with pdfgrep, but the brew install exploded on me (seems like the formula is not updated).
What worked for me is Automator to extract the pdf content to text, and then a quick and dirty script to extract the text and rename. See attached screenshot of the Automator action:
Upvotes: 0
Reputation: 6613
First please install pip:
sudo easy_install pip
or
brew install python
Secondly install pdfminer:
pip install pdfminer
By using pdfminer and standard libraries of Python I have created a script that specific to your problem:
rename.py
import commands
import re
import glob, os
os.chdir(".") # In this directory
for file in glob.glob("*.pdf"): # For all files with extension .pdf
pdf_text = commands.getstatusoutput('pdf2txt.py ' + file)[1] # Get text content of the pdf file
result = re.search('[0-9]-[0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9]-[0-9][A-Z]-[0-9][0-9][0-9][0-9][0-9]', pdf_text) # Search using a regex specific to your solution and find the license number
if result: # If license number has been found
command = 'mv ' + file + ' ' + result.group(0) + '.pdf'
commands.getstatusoutput(command) # Rename file to LICENSE_NUMBER.pdf
print command + ' :: Command executed.' # Show what command has been executed
You can execute it by simply typing python rename.py
.
This Python script will search the directory(same directory with itself) for files with .pdf extension.
Then it will search each file for the license numbers according to a regular expression that I wrote for you.
Lastly, if there is a result it will change file's name to LICENSE_NUMBER.pdf
Addition upon OP's comment:
If some other PDF documents have a little bit different formatting and this script is not working for them, simply investigate the text contents with:
commands.getstatusoutput('pdf2txt.py ' + file)
For your sample file it was:
...ct ATI- \nCorrespondence To\n\nLicense\nNumber\n\n9-91-053-01-4L-04292\n\nA IF - Chief. FF...
So I have created a regex to find substring \n\nLicense\nNumber\n\n9-91-053-01-4L-04292\n\nA
and get License Number from it. Maybe you can create a more tolerant/general regex for your PDF documents by investigating more samples.
Upvotes: 1
Reputation: 207465
Updated Answer
Ok, I think we can do a bit better now I understand the format of the number better...
#!/bin/bash
# Don't barf if no files, or if upper or lower case names
shopt -s nullglob nocaseglob
for f in *.pdf; do
lic=$(pdfgrep "[0-9]-[0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9]-[0-9][A-Z]-[0-9][0-9][0-9][0-9][0-9]" "$f" | grep -oE "[0-9-]+[A-Z][0-9-]+")
# Check licence is at least 15 characters, else do nothing
if [ ${#lic} -gt 15 ]; then
echo mv "$f" "${lic}.pdf"
fi
done
If it takes forever, you could also use homebrew
to install GNU Parallel so you can do them all in parallel and get the job done faster. So, you would install with:
brew install parallel
and then change the script to do just a single file like this:
#!/bin/bash
if [ $# -ne 1 ]; then
echo Usage: Renamer file
exit 1
fi
f="$1"
lic=$(pdfgrep "[0-9]-[0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9]-[0-9][A-Z]-[0-9][0-9][0-9][0-9][0-9]" "$f" | grep -oE "[0-9-]+[A-Z][0-9-]+")
# Check licence is at least 15 characters, else do nothing
if [ ${#lic} -gt 15 ]; then
echo mv "$f" "${lic}.pdf"
fi
Then you can get them all done with:
parallel ./Renamer ::: *.pdf
You can extract the license number using pdfgrep
, which you can install using homebrew
. You would need to go to the homebrew wesbite and copy the one-liner from there (which I don't want to put here in case it gets outdated) and paste it into Terminal and run it. Then you can install pdfgrep
with:
brew install pdfgrep
Alternatively, you can download, and build pdfgrep
yourself if you like doing that sort of thing! Download.
You can then extract the licence from a PDF file with:
pdfgrep -i "License Number" SomeFile.pdf | grep -oE "[0-9-]+[A-Z][0-9-]+"
and put that in a variable with:
lic=$(pdfgrep -i "License Number" SomeFile.pdf | grep -oE "[0-9-]+[A-Z][0-9-]+")
So, if you have 7,000 PDF files in a directory you would need to go that directory and save the following as a script called NameByLicence
:
#!/bin/bash
# Don't barf if no files, or if upper or lower case names
shopt -s nullglob nocaseglob
for f in *.pdf; do
lic=$(pdfgrep -i "License Number" "$f" | grep -oE "[0-9-]+[A-Z][0-9-]+")
# Check licence is at least 15 characters, else do nothing
if [ ${#lic} -gt 15 ]; then
echo mv "$f" "${lic}.pdf"
fi
done
Once you have saved the script, make it executable (just necessary once) with:
chmod +x NameByLicence
Then you can run it with:
./NameByLicence
PLEASE MAKE A BACKUP FIRST AND TEST ON A FEW DUMMY FILES
If it looks correct, remove the word echo
and it will actually do the name changes - at the moment it just tells you what it would do, rather than doing anything.
If you don't want to use homebrew
and pdfgrep
, you can do it with native OSX tools, but it is a bit harder. Basically, you make an Automator workflow to extract the text from your PDF into a temporary text document and then you convert that from UTF-16 into ASCII and grep
in there. If that makes sense to you, here are the steps:
Make an Automator workflow that looks like this:
You get /tmp
in the "Save Output to" field by using SHIFT+COMMAND+G and typing /tmp
. Check the Replace Existing Files box so that it still works for your second PDF when the licence from the previous file is there.
Save that as "as an Application", called pdf2text
. Now you can run the following instead of pdfgrep
:
./pdf2text.app/Contents/MacOS/"Application Stub" SomeFile.pdf
and it will extract the text to /tmp/licence.txt
. But you are not done yet, because that is UTF-16, so, to search in the file, you need:
iconv -c -f UTF-16 -t ASCII /tmp/licence.txt | grep -oE "[0-9A-Z-]{17,}"
9-91-053-01-4L-04292
So, now you need to put that inside the for
loop in the little bash
script above.
Upvotes: 1