Sammitch
Sammitch

Reputation: 32272

Is my regex too greedy?

Background: We're using a tape library and the backup software NetWorker to back up data here. The client that's installed is fairly basic, and when we need to restore more than one target directory we create a script that simply calls X client instances in the background via a script with X of the following lines:

recover -c client-srv -t "Mon Dec 10 08:00:00" -s barckup-srv -d /dest/dir/ -f -a /src/dir &

The trouble is that different partitions/directories backed up from the same machine at the same time might be spread across several different tapes, and some of those tapes may have been removed from the library between the backup and restore.

Up until recently the only ways the people here have been finding out about which tapes are needed were to either wait for the library to complain that it doesn't have a particular tape, or to set up a fake restore in an crappy old desktop GUI client and hit a particular menu option. The first option is super bad when the tape turns out to be off-site and takes a day to get back, and the second is tedious and time-consuming.

Actual Question: I've written a "meta"-script that reads the script that we've already created with the commands above, feeds it into the interactive CLI client, and gets it to spit out what tapes are required, and if they're actually in the library. To do this, the script uses the following regular expressions to pull out necessary info:

# pull out a list of the -a targets
restore_targets="`sed 's/^.* -a \([^ ]*\) .*$/\1/' $rec_script`"

# pull out a list of -c clients
restore_clients="`sed 's/^.* -c \([^ ]*\) .*$/\1/' $rec_script`"
numclients=`echo $restore_clients | uniq | wc -l`

# pull out a list of -t dates
restore_dates="`sed 's/^.* -t \"\([^\"]*\)\" .*$/\1/' $rec_script`"
numdates=`echo $restore_dates | uniq | wc -l`

I am not terribly familiar with using s/\(x\)/\1/ types of regexes, to the point that I don't remember the name, but is this the best way of accomplishing what I am doing? The commands work, but I'm wondering if I'm using the .* needlessly.

Upvotes: 2

Views: 168

Answers (1)

Blender
Blender

Reputation: 298502

\1 refers to the first capturing group. If you replace foo(.*?) with \1 and feed in foobar, the resulting text becomes bar, as \1 points to the text captured by the first capturing group.

As for your your question, it might be safer and easier to parse the arguments using Python (or another high-level scripting language):

>>> import shlex
>>> shlex.split('recover -c client-srv -t "Mon Dec 10 08:00:00" -s barckup-srv -d /dest/dir/ -f -a /src/dir &')
['recover', '-c', 'client-srv', '-t', 'Mon Dec 10 08:00:00', '-s', 'barckup-srv', '-d', '/dest/dir/', '-f', '-a', '/src/dir', '&']

Now, this is much easier to work with. The quotes are gone and all of the components of the command are nicely split up into a list.

If you want this to be completely foolproof, you could use argparse and implement your own parser for this command line pretty easily. This will enable you to easily get the info, but it might be overkill for your situation.

As for your actual question, you can dissect the regex:

^.* -t "([^\"]*)" .*$

This regex captures -t "foo \" bar", while a non-greedy version would stop at -t "foo \".

Upvotes: 1

Related Questions