ImageMagick convert tiffs to pdf with sequential file suffix

Question

I have the following scenario and I'm not much of a coder (nor do I know bash well). I don't even have a base working bash script to share, so any help would be appreciated.

I have a file share that contains tiffs (thousands) of a document management system. The goal is to convert and combine from multiple file tiffs to single file pdfs (preferably PDF/A 1a format).

The directory format:

/Document Management Root     # This is root directory
 ./2009/                      # each subdirectory represents a year
 ./2010/
 ./2011/
 ....
 ./2016/
 ./2016/000009.001            
 ./2016/000010.001
              # files are stored flat - just thousands of files per year directory

The document management system stores tiffs with sequential number file names along with sequential file suffixes:

Where each page of a document is represented by the suffix. The suffix restarts when a new, non-related document is created. In the example above, 000009.001 is a single page tiff. Files 000010.001, 000011.002, and 000012.003 belong to the same document (i.e. the pages are all related). File 000013.001 represents a new document.

I need to preserve the file name for the first file of a multipage document so that the filename can be cross referenced with the document management system database for metadata.

The pseudo code I've come up with is:

for each file in {tiff directory}
    while file extension is "001"
      convert file to pdf and place new pdf file in {pdf directory}
    else 
      convert multiple files to pdf and place new pd file in {pdf  directory}

But this seems like it will have the side effect of converting all 001 files regardless of what the next file is.

Any help is greatly appreciated.

EDIT - Both answers below work. The second answer worked, however it was my mistake in not realizing that the data set I tested against was different than my scenario above.

Mark Setchell · Accepted Answer

So, save the following script in your login ($HOME) directory as TIFF2PDF

#!/bin/bash
ls *[0-9] | awk -F'.' '
   /001$/ { if(NR>1)print cmd,outfile; outfile=$1 ".pdf"; cmd="convert " $0;next}
          { cmd=cmd " " $0}
   END    { print cmd,outfile}'

and make it executable (necessary just once) by going in Terminal and running:

chmod +x TIFF2PDF

Then copy a few documents from any given year into a temporary directory to try things out... then go to the directory and run:

~/TIFF2PDF

Sample Output

convert 000009.001 000009.pdf
convert 000010.001 000011.002 000012.003 000010.pdf
convert 000013.001 000013.pdf

If that looks correct, you can actually execute those commands like this:

~/TIFF2PDF | bash

or, preferably if you have GNU Parallel installed:

~/TIFF2PDF | parallel

The script says... "Generate a listing of all files whose names end in a digit and send that list to awk. In awk, use the dot as the separator between fields, so if the file is called 00011.0002, then $0 will be 00011.0002, $1 will be 00011 and $2 will be 0002. Now, if the filename ends in 0001, print the accumulated command and append the output filename. Then save the filename prefix with PDF extension as the output filename of the next PDF and start building up the next ImageMagick convert command. On subsequent lines (which don't end in 0001), add the filename to the list of filenames to include in the PDF. At the end, output any accumulated commands and append the output filename."

As regards the ugly black block at the bottom of your image, it happens because there are some tiny white specks in there that prevent ImageMagick from removing the black area. I have circled them in red:

If you blur the picture a little (to diffuse the specks) and then get the size of the trim-box, you can apply that to the original, unblurred image like this:

trimbox=$(convert original.tif -blur x2 -bordercolor black -border 1 -fuzz 50% -format %@ info:)
convert original.tif -crop $trimbox result.tif

I would recommend you do that first to A COPY of all your images, then run the PDF conversion afterwards. As you will want to save a TIFF file but with the extension 0001, 0002, you will need to tell ImageMagick to trim and force the output filetype to TIF:

original=XYZ.001
trimbox=$(convert $original -blur x2 -bordercolor black -border 1 -fuzz 50% -format %@ info:)
convert $original -crop $trimbox TIF:$original

As @AlexP. mentions, there can be issues with globbing if there is a large number of files. On OSX, ARG_MAX is very high (262144) and your filenames are around 10 characters, so you may hit problems if there are more than around 26,000 files in one directory. If that is the case, simply change:

ls *[0-9] | awk ...

to

ls | grep "\d$" | awk ...

ImageMagick convert tiffs to pdf with sequential file suffix

Answers (2)

Related Questions