Orsiris de Jong
Orsiris de Jong

Reputation: 3016

How to refactor a find | xargs one liner to a human readable code

I've written an OCR wrapper batch & service script for tesseract and abbyyocr11 found here: https://github.com/deajan/pmOCR

The main function is a find command that passes it's arguments to xargs with -print0 in order to deal with special filenmames. The find command became more and more complex and ended up as a VERY long one liner that becomes difficult to maintain:

find "$DIRECTORY_TO_PROCESS" -type f -iregex ".*\.$FILES_TO_PROCES" ! -name "$find_excludes" -print0 | xargs -0 -I {} bash -c 'export file="{}"; function proceed { eval "\"'"$OCR_ENGINE_EXEC"'\" '"$OCR_ENGINE_INPUT_ARG"' \"$file\" '"$OCR_ENGINE_ARGS"' '"$OCR_ENGINE_OUTPUT_ARG"' \"${file%.*}'"$FILENAME_ADDITION""$FILENAME_SUFFIX$FILE_EXTENSION"'\" && if [ '"$_BATCH_RUN"' -eq 1 ] && [ '"$_SILENT"' -ne 1 ];then echo \"Processed $file\"; fi && echo -e \"$(date) - Processed $file\" >> '"$LOG_FILE"' && if [ '"$DELETE_ORIGINAL"' == \"yes\" ]; then rm -f \"$file\"; fi"; }; if [ "'$CHECK_PDF'" == "yes" ]; then if ! pdffonts "$file" 2>&1 | grep "yes" > /dev/null; then proceed; else echo "$(date) - Skipping file $file already containing text." >> '"$LOG_FILE"'; fi; else proceed; fi'

Is there a nicer way to pass the find results to a human readable function (without impacting too much speed) ?

Thanks.

Upvotes: 2

Views: 219

Answers (4)

Orsiris de Jong
Orsiris de Jong

Reputation: 3016

I finished using a while loop with a substituted find command, ie:

while IFS= read -r -d $'\0' file; do
        if ! lsof -f -- "$file" > /dev/null 2>&1; then
            if [ "$_BATCH_RUN" == true ]; then
                Logger "Preparing to process [$file]." "NOTICE"
            fi
            OCR "$file" "$fileExtension" "$ocrEngineArgs" "$csvHack"
        else
            if [ "$_BATCH_RUN" == true ]; then
                Logger "Cannot process file [$file] currently in use." "ALWAYS"
            else
                Logger "Deferring file [$file] currently being written to." "ALWAYS"
                kill -USR1 $SCRIPT_PID
            fi
        fi
    done < <(find "$directoryToProcess" -type f -iregex ".*\.$FILES_TO_PROCES" ! -name "$findExcludes" -and ! -wholename "$moveSuccessExclude" -and ! -wholename "$moveFailureExclude" -and ! -name "$failedFindExcludes" -print0)

The while loop reads every file from the find command in file variable. Using -d $'\0' in while and -print0 in find command helps dealing with special filenames.

Upvotes: 0

chepner
chepner

Reputation: 531325

You can replace find altogether. It's easier in bash 4 (which I'll show here), but doable in bash 3.

proceed () {
  ...
}

shopt -s globstar

extensions=(pdf tif tiff jpg jpeg bmp pcx dcx)
for ext in "${extensions[@]}"; do
  for file in /some/path/**/*."$ext"; do
    [[ ! -f $file || $file = *_ocr.pdf ]] && continue
    # Rest of script here
  done
done

Prior to bash 4, you can write your own recursive function to descend through a directory hierarchy.

descend () {
    for fd in "$1"/*; do
        if [[ -d $fd ]]; then
            descend "$fd"
        elif [[ ! -f $fd || $fd != *."$ext" || $fd = *_ocr.pdf ]]; then
            continue
        else
            # Rest of script here
        fi
     done
 }

 for ext in "${extensions[@]}"; do
     descend /some/path "$ext"
 done

Upvotes: 2

glenn jackman
glenn jackman

Reputation: 246847

OK, create the script, then run find.

#!/bin/bash

trap cleanup EXIT
cleanup() { rm "$script"; }

script=$(mktemp)
cat <<'END' > "$script"
########################################################################
file="$1"

function proceed { 
    "$OCR_ENGINE_EXEC" "$OCR_ENGINE_INPUT_ARG" "$file" "$OCR_ENGINE_ARGS" "$OCR_ENGINE_OUTPUT_ARG" "${file%.*}$FILENAME_ADDITION$FILENAME_SUFFIX$FILE_EXTENSION"
    if [ "$_BATCH_RUN" -eq 1 ] && [ "$_SILENT" -ne 1 ]; then 
        echo "Processed $file"
    fi
    echo -e "$(date) - Processed $file" >> "$LOG_FILE"
    if [ "$DELETE_ORIGINAL" == "yes" ]; then 
        rm -f "$file"
    fi
}

if [ "$CHECK_PDF" == "yes" ]; then 
    if ! pdffonts "$file" 2>&1 | grep "yes" > /dev/null; then 
        proceed
    else 
        echo "$(date) - Skipping file $file already containing text." >> '"$LOG_FILE"'; 
    fi
else 
    proceed
fi
########################################################################
END

find "$DIRECTORY_TO_PROCESS" -type f \
                             -iregex ".*\.$FILES_TO_PROCES" \
                           ! -name "$find_excludes" \
                             -exec bash "$script" '{}' \;

The 'END' of the heredoc is quoted, so the variables are not expanded until the script is actually executed.

Upvotes: 2

chepner
chepner

Reputation: 531325

Don't use bash -c. You are already committed to starting a new bash process for each file from the find command, so just save the code to a file and run that with

find "$DIRECTORY_TO_PROCESS" -type f -iregex ".*\.$FILES_TO_PROCES" \
     ! -name "$find_excludes" -print0 |
     xargs -0 -I {} bash script.bash {}

Upvotes: 3

Related Questions