POSIX regex: match only within comma-delimited items

Question

Context: I'm writing a shell script to help manage a simple database stored in a human-readable way in text files and edited using a normal text editor. (Each entry is a text file whose name is an ID number, and all files are stored in a single directory.)

My current problem is searching. There are some headers which are basically data fields at the top of the file. For instance, let's take the tags field, which begins on a new line with Tags: (where is a literal tab character) and then has a list of comma-separated tags. I'd like to be able to plug a regex provided by the user into a larger call to grep and have the user's regex match only within each comma-separated item.

Here's a bit from my documentation that describes what I would like to happen:

hregexes are EREs matched only within comma-separated items. For instance, with the header Tags: foo, bar baz:

REGEX     :: MATCHES?
foo       :: yes
bar       :: yes
baz       :: yes
az        :: yes
.*baz     :: yes
ba.*az    :: yes
o, ba     :: no
foo.*baz  :: no

This would ideally work purely with POSIX extended regexps, for consistency with the rest of the system; I had a simplified version of the search working in Python but decided I should rewrite that portion so the system wouldn't have some searches take POSIX regexps and some Python ones.

I did try to come up with a pattern, but I'm not really good enough with regexps to do something this complicated. In the following attempt, $2 is the header we're looking for, and $3 is the pattern to match in that header.

grep -El "$2:   (|.*,|.*, )[^,]*$3[^,]*(,|\b)" *.dre

This doesn't miss any results it should catch, but it has the problem that o, ba and foo.*baz both match when they shouldn't; at this point I might as well just search for $2: .*$3.

If this isn't possible with a single ERE, is there another good way to do this in Bash? My database already has over a thousand files and could easily grow to many times that, so I'd prefer not to be looping over each file and then over each item in the comma-separated list and incurring shell overhead every time.

Soren Bjornstad · Accepted Answer

The following solution, based on Perry's idea of changing the separator, is not foolproof, but preserves the desirable running time while making it pretty hard to screw up.

First, we choose a delimiter to replace the commas with; I have chosen @@@@@, reasoning that this will not occur in any properly formed tag. (The tags are normally purely alphanumeric.)

We then modify the user's regexp to replace . with [^@] so that no expression will cross the @@@@@ boundaries unless explicitly composed to. I may be missing some other matches, say [[:punct:]]; I'm not terribly worried about those, but if someone has thoughts about other special characters that might be problematic, I'd like to hear about them.

Finally, we create a stream that contains all of the Tags lines, edit it to contain just the filenames and the new @-delimited tags, apply the user's pattern match to this stream, and then remove everything but the filenames from the stream of matches.

Final code:

header="$2"
pattern=$(echo "$3" | sed -e 's/\./[^@]/')
grep -m 1 "$header: " *.dre | sed -e "s/$header:        //" | \
    sed -e 's/, /@@@@@/g' | grep -E "$pattern" | \
    sed -e 's/$[0-9]\{5\}\.dre$:.*/\1/'

([0-9]\{5\}\.dre is an expression that matches all legal filenames.)

Example output:

00775.dre
00787.dre
00788.dre
00883.dre
00889.dre

(Obviously, the matches can be saved into a variable for further processing; that's what I'm doing here.)

POSIX regex: match only within comma-delimited items

Answers (2)

Related Questions