Curious George
Curious George

Reputation: 21

ImageMagick convert tiffs to pdf with sequential file suffix

I have the following scenario and I'm not much of a coder (nor do I know bash well). I don't even have a base working bash script to share, so any help would be appreciated.

I have a file share that contains tiffs (thousands) of a document management system. The goal is to convert and combine from multiple file tiffs to single file pdfs (preferably PDF/A 1a format).

The directory format:

/Document Management Root     # This is root directory
 ./2009/                      # each subdirectory represents a year
 ./2010/
 ./2011/
 ....
 ./2016/
 ./2016/000009.001            
 ./2016/000010.001
              # files are stored flat - just thousands of files per year directory

The document management system stores tiffs with sequential number file names along with sequential file suffixes:

000009.001
000010.001
000011.002
000012.003
000013.001

Where each page of a document is represented by the suffix. The suffix restarts when a new, non-related document is created. In the example above, 000009.001 is a single page tiff. Files 000010.001, 000011.002, and 000012.003 belong to the same document (i.e. the pages are all related). File 000013.001 represents a new document.

I need to preserve the file name for the first file of a multipage document so that the filename can be cross referenced with the document management system database for metadata.

The pseudo code I've come up with is:

for each file in {tiff directory}
    while file extension is "001"
      convert file to pdf and place new pdf file in {pdf directory}
    else 
      convert multiple files to pdf and place new pd file in {pdf  directory}

But this seems like it will have the side effect of converting all 001 files regardless of what the next file is.

Any help is greatly appreciated.

EDIT - Both answers below work. The second answer worked, however it was my mistake in not realizing that the data set I tested against was different than my scenario above.

Upvotes: 1

Views: 1401

Answers (2)

Mark Setchell
Mark Setchell

Reputation: 207425

So, save the following script in your login ($HOME) directory as TIFF2PDF

#!/bin/bash
ls *[0-9] | awk -F'.' '
   /001$/ { if(NR>1)print cmd,outfile; outfile=$1 ".pdf"; cmd="convert " $0;next}
          { cmd=cmd " " $0}
   END    { print cmd,outfile}'

and make it executable (necessary just once) by going in Terminal and running:

chmod +x TIFF2PDF    

Then copy a few documents from any given year into a temporary directory to try things out... then go to the directory and run:

~/TIFF2PDF

Sample Output

convert 000009.001 000009.pdf
convert 000010.001 000011.002 000012.003 000010.pdf
convert 000013.001 000013.pdf

If that looks correct, you can actually execute those commands like this:

~/TIFF2PDF | bash

or, preferably if you have GNU Parallel installed:

~/TIFF2PDF | parallel

The script says... "Generate a listing of all files whose names end in a digit and send that list to awk. In awk, use the dot as the separator between fields, so if the file is called 00011.0002, then $0 will be 00011.0002, $1 will be 00011 and $2 will be 0002. Now, if the filename ends in 0001, print the accumulated command and append the output filename. Then save the filename prefix with PDF extension as the output filename of the next PDF and start building up the next ImageMagick convert command. On subsequent lines (which don't end in 0001), add the filename to the list of filenames to include in the PDF. At the end, output any accumulated commands and append the output filename."


As regards the ugly black block at the bottom of your image, it happens because there are some tiny white specks in there that prevent ImageMagick from removing the black area. I have circled them in red:

enter image description here

If you blur the picture a little (to diffuse the specks) and then get the size of the trim-box, you can apply that to the original, unblurred image like this:

trimbox=$(convert original.tif -blur x2 -bordercolor black -border 1 -fuzz 50% -format %@ info:)
convert original.tif -crop $trimbox result.tif

enter image description here

I would recommend you do that first to A COPY of all your images, then run the PDF conversion afterwards. As you will want to save a TIFF file but with the extension 0001, 0002, you will need to tell ImageMagick to trim and force the output filetype to TIF:

original=XYZ.001
trimbox=$(convert $original -blur x2 -bordercolor black -border 1 -fuzz 50% -format %@ info:)
convert $original -crop $trimbox TIF:$original

As @AlexP. mentions, there can be issues with globbing if there is a large number of files. On OSX, ARG_MAX is very high (262144) and your filenames are around 10 characters, so you may hit problems if there are more than around 26,000 files in one directory. If that is the case, simply change:

ls *[0-9] | awk ...

to

ls | grep "\d$" | awk ...

Upvotes: 2

Alex P.
Alex P.

Reputation: 31666

The following command would convert the whole /Document Management Root tree (assuming it's actual absolute path) properly processing all subfolders even with names including whitespace characters and properly skipping all other files not matching the 000000.000 naming pattern:

find '/Document Management Root' -type f -regextype sed -regex '.*/[0-9]\{6\}.001$' -exec bash -c 'p="{}"; d="${p:0: -10}"; n=${p: -10:6}; m=10#$n; c[1]="$d$n.001"; for i in {2..999}; do k=$((m+i-1)); l=$(printf "%s%06d.%03d" "$d" $k $i); [[ -f "$l" ]] || break; c[$i]="$l"; done; echo -n "convert"; printf " %q" "${c[@]}" "$d$n.pdf"; echo' \; | bash

To do a dry run just remove the | bash in the end.

Updated to match the 00000000.000 pattern (and split to multiple lines for clarity):

find '/Document Management Root' -type f -regextype sed -regex '.*/[0-9]\{8\}.001$' -exec bash -c '
  pages[1]="{}"
  p1num="10#${pages[1]: -12:8}"
  for i in {2..999}; do
    nextpage=$(printf "%s%08d.%03d" "${pages[1]:0: -12}" $((p1num+i-1)) $i)
    [[ -f "$nextpage" ]] || break
    pages[i]="$nextpage"
  done
  echo -n "convert"
  printf " %q" "${pages[@]}" "${pages[1]:0: -3}pdf"
  echo
' \; | bash

Upvotes: 1

Related Questions