johndashen
johndashen

Reputation: 224

shell scripting: search/replace & check file exist

I have a perl script (or any executable) E which will take a file foo.xml and write a file foo.txt. I use a Beowulf cluster to run E for a large number of XML files, but I'd like to write a simple job server script in shell (bash) which doesn't overwrite existing txt files.

I'm currently doing something like

#!/bin/sh
PATTERN="[A-Z]*0[1-2][a-j]"; # this matches foo in all cases 
todo=`ls *.xml | grep $PATTERN -o`;
isdone=`ls *.txt | grep $PATTERN -o`;

whatsleft=todo - isdone; # what's the unix magic?

#tack on the .xml prefix with sed or something

#and then call the job server; 
jobserve E "$whatsleft";

and then I don't know how to get the difference between $todo and $isdone. I'd prefer using sort/uniq to something like a for loop with grep inside, but I'm not sure how to do it (pipes? temporary files?)

As a bonus question, is there a way to do lookahead search in bash grep?

To clarify/extend the problem:

I have a bunch of programs that take input from sources like (but not necessarily) data/{branch}/special/{pattern}.xml and write output to another directory results/special/{branch}-{pattern}.txt (or data/{branch}/intermediate/{pattern}.dat, e.g.). I want to check in my jobfarming shell script if that file already exists.

So E transforms data/{branch}/special/{pattern}.xml->results/special/{branch}-{pattern}.dat, for instance. I want to look at each instance of the input and check if the output exists. One (admittedly simpler) way to do this is just to touch *.done files next to each input file and check for those results, but I'd rather not manage those, and sometimes the jobs terminate improperly so I wouldn't want them marked done.

N.B. I don't need to check concurrency yet or lock any files.

So a simple, clear way to solve the above problem (in pseudocode) might be

for i in `/bin/ls *.xml`
do
   replace xml suffix with txt
   if [that file exists]
      add to whatsleft list
   end
done

but I'm looking for something more general.

Upvotes: 0

Views: 1875

Answers (5)

Charles Duffy
Charles Duffy

Reputation: 295678

#!/bin/sh

shopt -s extglob # allow extended glob syntax, for matching the filenames

LC_COLLATE=C     # use a sort order comm is happy with

IFS=$'\n'        # so filenames can have spaces but not newlines
                 # (newlines don't work so well with comm anyhow;
                 # shame it doesn't have an option for null-separated
                 # input lines).

files_todo=( **([A-Z])0[1-2][a-j]*.xml )
files_done=( **([A-Z])0[1-2][a-j]*.txt )
files_remaining=( \
  $(comm -23 --nocheck-order \
    <(printf "%s\n" "${files_todo[@]%.xml}") \
    <(printf "%s\n" "${files_done[@]%.txt}") ))

echo jobserve E $(for f in "${files_remaining[@]%.xml}"; do printf "%s\n" "${f}.txt"; done)

This assumes that you want a single jobserve E call with all the remaining files as arguments; it's rather unclear from the specification if such is the case.

Note the use of extended globs rather than parsing ls, which is considered very poor practice.

To transform input to output names without using anything other than shell builtins, consider the following:

if [[ $in_name =~ data/([^/]+)/special/([^/]+).xml ]] ; then
  out_name=results/special/${BASH_REMATCH[1]}-${BASH_REMATCH[2]}.dat
else
  : # ...handle here the fact that you have a noncompliant name...
fi

Upvotes: 1

johndashen
johndashen

Reputation: 224

for posterity's sake, this is what i found to work:

TMPA='neverwritethis.tmp'
TMPB='neverwritethat.tmp'
ls *.xml | grep $PATTERN -o > $TMPA;
ls *.txt | grep $PATTERN -o > $TMPB;
whatsleft = `sort $TMPA $TMPB | uniq -u | sed "s/%/.xml" > xargs`;
rm $TMPA $TMPB;

Upvotes: 0

slacker
slacker

Reputation: 2142

whatsleft=$( ls *.xml *.txt | grep $PATTERN -o | sort | uniq -u )

Note this actually gets a symmetric difference.

Upvotes: 1

Jonathan Leffler
Jonathan Leffler

Reputation: 754620

The question title suggests that you might be looking for:

 set -o noclobber

The question content indicates a wholly different problem!

It seems you want to run 'jobserve E' on each '.xml' file without a matching '.txt' file. You'll need to assess the TOCTOU (Time of Check, Time of Use) problems here because you're in a cluster environment. But the basic idea could be:

 todo=""
 for file in *.xml
 do [ -f ${file%.xml}.txt ] || todo="$todo $file"
 done
 jobserve E $todo

This will work with Korn shell as well as Bash. In Bash you could explore making 'todo' into an array; that will deal with spaces in file names better than this will.

If you have processes still generating '.txt' files for '.xml' files while you run this check, you will get some duplicated effort (because this script cannot tell that the processing is happening). If the 'E' process creates the corresponding '.txt' file as it starts processing it, that minimizes the chance or duplicated effort. Or, maybe consider separating the processed files from the unprocessed files, so the 'E' process moves the '.xml' file from the 'to-be-done' directory to the 'done' directory (and writes the '.txt' file to the 'done' directory too). If done carefully, this can avoid most of the multi-processing problems. For example, you could link the '.xml' to the 'done' directory when processing starts, and ensure appropriate cleanup with an 'atexit()' handler (if you are moderately confident your processing program does not crash). Or other trickery of your own devising.

Upvotes: 1

ghostdog74
ghostdog74

Reputation: 342769

i am not exactly sure what you want, but you can check for existence of the file first, if it exists, create a new name? ( Or in your E (perl script) you do this check. )

if [ -f "$file" ];then
  newname="...."
fi
...
jobserve E .... > $newname 

if its not what you want, describe more clearly in your question what you mean by "don't overwrite files"..

Upvotes: 0

Related Questions