Running shell scripts from Mule ESB

Question

I have a flow set up to recognize when a file is dropped into a directory. Next I need to run a Bash script that processes the file (fairly intensive processing). The script grabs a PDF, creates a temporary directory, breaks the PDF into separate PNG files, runs an OCR processor against each image, converts the result to single-page PDFs, then merges all of the PDFs into a single multi-page PDF with the text layer from the OCR.

The problem is, the Bash script chokes after 10 concurrent transformations are triggered. Right now I have Mule ESB listening for new files, then triggering the script for each file, passing the appropriate parameters. Unfortunately, Mule has two tasks, listen -> trigger. We are going to have over 200 files in that directory that need to be queued for processing, preferably 5 at a time. How do I get Mule to limit the number of concurrent processes triggered?

Below is my initial draft Flow:

Here is the actual Bash script (gives some hints on what tools we are using):

#!/bin/bash

#Setting variables
PARAM=$#
TMPDIR=./split
INFILENAME=${1##*/}
OUTFILENAME=${2##*/}
echo "1 is $1"
echo "2 is $2"
echo "infilename is $INFILENAME"
echo "outfilename is $OUTFILENAME"

#Logging I/O filenames
echo "infile: $1" >> error.log
echo "outfile: $2" >> error.log

#If the temporary directory doesn't exist, make it
if [ ! -d "$TMPDIR" ];
then
    mkdir $TMPDIR
fi

#Check to see that the correct number of params have been passed.
if [[ $PARAM -lt 2 ]];
then
    echo "Usage: $0 source.pdf output.pdf"
    echo "output.pdf is the desired output file"
    echo "source.pdf is a file to be OCR'd"
    exit 1
fi

#Make sure the input file is a PDF
if [ "${1##*.}" == "pdf" ];
then
    multilayer=false

    #Check to see if the input file is a multi-layered pdf with searchable text
        if grep -Fl "Font" "$1"; then multilayer=true; fi

    #If it's not multi-layered, then perform the OCR
    if [ "$multilayer" == "false" ];
    then
        mkdir $TMPDIR/"$INFILENAME/"
        echo "making temporary directory $TMPDIR/$INFILENAME"
        #Split the PDF into pdf's of one page per df in a temporary directory
        pdftk "$1" burst output "$TMPDIR/$INFILENAME/pg_%04d.pdf"
        echo "burse output to $TMPDIR/$INFILENAME/pg_%04d.pdf"
        mv "$1" processed/
        for files in "$TMPDIR/$INFILENAME/"*
            do
            echo "$files"
                    filename=$(basename "$files")
                    filename="${filename%.*}"

            #Convert the pdf page into an image
                    gs -r300 -o "$TMPDIR/$INFILENAME/$filename.jpeg" -sDEVICE=jpeg "$TMPDIR/$INFILENAME/$filename.pdf"

            #Perform the OCR against the image
                    tesseract "$TMPDIR/$INFILENAME/$filename.jpeg" "$TMPDIR/$INFILENAME/$filename" hocr

            #Combine the OCR'd image and OCR'd text into a multi-layer PDF file of that page
                    hocr2pdf -i "$TMPDIR/$INFILENAME/$filename.jpeg" -o "$TMPDIR/$INFILENAME/$filename.pdf" < "$TMPDIR/$INFILENAME/$filename.html"
                    compressed="$filename-compressed.pdf"

            #Compress the multi-layered PDF of the page
                    gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$TMPDIR/$INFILENAME/$compressed $TMPDIR/$INFILENAME/$filename.pdf"
                    mv "$TMPDIR/$INFILENAME/$compressed" "$TMPDIR/$INFILENAME/$filename"
            done

        #Concatenate all of the multiline PDF pages into a single PDF file
        pdftk "$TMPDIR/$INFILENAME/"*.pdf cat output "$OUTFILENAME"
        compressed="$OUTFILENAME-compressed.pdf"

        #Compress the multi-layered PDF
        gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile="$compressed" "$OUTFILENAME"
        mv "$compressed" "$2"
        rm -rf "$TMPDIR/$INFILENAME"
    else
        echo "The input file is multi-layered"
        mv "$1" "$2"
    fi
else
    echo "Please enter a valid input pdf file"
    exit 2
fi

Thaneofife · Accepted Answer

@genjosanzo...you put me on the right track thinking about the processing strategy. Here is the solution that ended up working:

Running shell scripts from Mule ESB

Answers (2)

Related Questions