Planterguy
Planterguy

Reputation: 37

Batch Renaming PDFs using content from document on Mac

I have 7000 pdf docs in folder "ffl" they all have been through ocr program so contents can be copy and pasted.

Each document contains text "license -----*****" The number is 15 digits long has dashes and the 10th item is a letter.

Need a batch to rename all files by the license number in the document.

Is there a script I can run to accomplish this? Have been searching for about a week. Everything is talking about the new way to rename from finder. Nothing on renaming from contents of doc. Pretty new to terminal.

I have seen the basic command for renaming mv "old location" "new location"

mv /home/user/my_static /home/user/static

Right now I copy the number and paste as file name. Need a faster way.

Please and thank you for any advise.

Upvotes: 1

Views: 4673

Answers (3)

Jean-Frederic PLANTE
Jean-Frederic PLANTE

Reputation: 844

I had a similar problem where I wanted to rename a bunch of pdf files with content extracted from the pdf file (in that case a date). I tried at first to do a bash only with pdfgrep, but the brew install exploded on me (seems like the formula is not updated).

What worked for me is Automator to extract the pdf content to text, and then a quick and dirty script to extract the text and rename. See attached screenshot of the Automator action:

  • first part cleans up the temp directory (in my case copying the pdfs into "renaming_pdfs"
  • extract the text into rtf
  • script grabs the text that I want to rename the file to (in this case the content of the line follow "US4") and renames the files

enter image description here

Upvotes: 0

mertyildiran
mertyildiran

Reputation: 6613

First please install pip:

sudo easy_install pip

or

brew install python

Secondly install pdfminer:

pip install pdfminer

By using pdfminer and standard libraries of Python I have created a script that specific to your problem:

rename.py

import commands
import re
import glob, os

os.chdir(".") # In this directory
for file in glob.glob("*.pdf"): # For all files with extension .pdf

    pdf_text = commands.getstatusoutput('pdf2txt.py ' + file)[1] # Get text content of the pdf file

    result = re.search('[0-9]-[0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9]-[0-9][A-Z]-[0-9][0-9][0-9][0-9][0-9]', pdf_text) # Search using a regex specific to your solution and find the license number

    if result: # If license number has been found
        command = 'mv ' + file + ' ' + result.group(0) + '.pdf'
        commands.getstatusoutput(command) # Rename file to LICENSE_NUMBER.pdf
        print command + ' :: Command executed.' # Show what command has been executed

You can execute it by simply typing python rename.py.

This Python script will search the directory(same directory with itself) for files with .pdf extension.

Then it will search each file for the license numbers according to a regular expression that I wrote for you.

Lastly, if there is a result it will change file's name to LICENSE_NUMBER.pdf

Addition upon OP's comment:

If some other PDF documents have a little bit different formatting and this script is not working for them, simply investigate the text contents with:

commands.getstatusoutput('pdf2txt.py ' + file)

For your sample file it was:

...ct ATI- \nCorrespondence To\n\nLicense\nNumber\n\n9-91-053-01-4L-04292\n\nA IF  - Chief. FF...

So I have created a regex to find substring \n\nLicense\nNumber\n\n9-91-053-01-4L-04292\n\nA and get License Number from it. Maybe you can create a more tolerant/general regex for your PDF documents by investigating more samples.

Upvotes: 1

Mark Setchell
Mark Setchell

Reputation: 207465

Updated Answer

Ok, I think we can do a bit better now I understand the format of the number better...

#!/bin/bash
# Don't barf if no files, or if upper or lower case names
shopt -s nullglob nocaseglob

for f in *.pdf; do
    lic=$(pdfgrep "[0-9]-[0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9]-[0-9][A-Z]-[0-9][0-9][0-9][0-9][0-9]" "$f" | grep -oE "[0-9-]+[A-Z][0-9-]+")
    # Check licence is at least 15 characters, else do nothing
    if [ ${#lic} -gt 15 ]; then
       echo mv "$f" "${lic}.pdf"
    fi
done

If it takes forever, you could also use homebrew to install GNU Parallel so you can do them all in parallel and get the job done faster. So, you would install with:

brew install parallel

and then change the script to do just a single file like this:

#!/bin/bash
if [ $# -ne 1 ]; then
   echo Usage: Renamer file
   exit 1
fi
f="$1"
lic=$(pdfgrep "[0-9]-[0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9]-[0-9][A-Z]-[0-9][0-9][0-9][0-9][0-9]" "$f" | grep -oE "[0-9-]+[A-Z][0-9-]+")
# Check licence is at least 15 characters, else do nothing
if [ ${#lic} -gt 15 ]; then
   echo mv "$f" "${lic}.pdf"
fi

Then you can get them all done with:

parallel ./Renamer ::: *.pdf

Option 1

You can extract the license number using pdfgrep, which you can install using homebrew. You would need to go to the homebrew wesbite and copy the one-liner from there (which I don't want to put here in case it gets outdated) and paste it into Terminal and run it. Then you can install pdfgrep with:

brew install pdfgrep

Alternatively, you can download, and build pdfgrep yourself if you like doing that sort of thing! Download.

You can then extract the licence from a PDF file with:

pdfgrep -i "License Number" SomeFile.pdf | grep -oE "[0-9-]+[A-Z][0-9-]+"

and put that in a variable with:

lic=$(pdfgrep -i "License Number" SomeFile.pdf | grep -oE "[0-9-]+[A-Z][0-9-]+")

So, if you have 7,000 PDF files in a directory you would need to go that directory and save the following as a script called NameByLicence:

#!/bin/bash
# Don't barf if no files, or if upper or lower case names
shopt -s nullglob nocaseglob

for f in *.pdf; do
    lic=$(pdfgrep -i "License Number" "$f" | grep -oE "[0-9-]+[A-Z][0-9-]+")
    # Check licence is at least 15 characters, else do nothing
    if [ ${#lic} -gt 15 ]; then
       echo mv "$f" "${lic}.pdf"
    fi
done

Once you have saved the script, make it executable (just necessary once) with:

chmod +x NameByLicence

Then you can run it with:

./NameByLicence

PLEASE MAKE A BACKUP FIRST AND TEST ON A FEW DUMMY FILES

If it looks correct, remove the word echo and it will actually do the name changes - at the moment it just tells you what it would do, rather than doing anything.

Option 2

If you don't want to use homebrew and pdfgrep, you can do it with native OSX tools, but it is a bit harder. Basically, you make an Automator workflow to extract the text from your PDF into a temporary text document and then you convert that from UTF-16 into ASCII and grep in there. If that makes sense to you, here are the steps:

Make an Automator workflow that looks like this:

enter image description here

You get /tmp in the "Save Output to" field by using SHIFT+COMMAND+G and typing /tmp. Check the Replace Existing Files box so that it still works for your second PDF when the licence from the previous file is there.

Save that as "as an Application", called pdf2text. Now you can run the following instead of pdfgrep:

./pdf2text.app/Contents/MacOS/"Application Stub" SomeFile.pdf

and it will extract the text to /tmp/licence.txt. But you are not done yet, because that is UTF-16, so, to search in the file, you need:

iconv -c -f UTF-16 -t ASCII /tmp/licence.txt | grep -oE "[0-9A-Z-]{17,}" 
9-91-053-01-4L-04292

So, now you need to put that inside the for loop in the little bash script above.

Upvotes: 1

Related Questions