Reputation: 304
I want to extract the text inside these XML tags using grep (no XMLStarlet or similar tools, even though they would be easier). I have done this before with grep, but this particular case is a bit more complex. The tags contain a unique alphanumeric identifier with hyphens (a MusicBrainz ID):
<artist mbid="eaefd603-84c1-4db4-a72b-0cb718a0cc07">Chelsea Wolfe</artist>
I have tried this and numerous variations:
grep -Po '(?<=<artist mbid=.*?>).*?(?=</artist>)'
In almost all cases, I get a "grep: lookbehind assertion is not fixed length". In Perl, iirc, \K
is a solution to that error, but I'm not sure precisely where to put that (I'm a regexp novice, in case you couldn't tell). I've been unsuccessful with simple trial-and-error.
I've spent a few hour searching SO and Google, and I couldn't find anything similar enough to be of help (possibly I missed something). So, my question is: Using grep, how can I extract the text in-between tags when those tags include unique alphanumeric identifiers?
Upvotes: 0
Views: 596
Reputation: 56935
To use \K
("keep out" -- drop what is matched so far), try
grep -oP '<artist mbid=.*?>\K.*?(?=</artist>)'
i.e. you put the \K
after the bit you want to drop from the match.
Otherwise if you want to use the lookbehind:
If you do not expect a '>' to be in an artist's name, and you expect your XML to be well-formed (no mismatching tags), and you don't expect tags to be nested inside an artist, try
grep -Po '(?<=>)[^>]+(?=</artist>)'
Upvotes: 2