philosophie
philosophie

Reputation: 304

Bash - grep text inside XML tags that contain unique alphanumeric strings

I want to extract the text inside these XML tags using grep (no XMLStarlet or similar tools, even though they would be easier). I have done this before with grep, but this particular case is a bit more complex. The tags contain a unique alphanumeric identifier with hyphens (a MusicBrainz ID):

<artist mbid="eaefd603-84c1-4db4-a72b-0cb718a0cc07">Chelsea Wolfe</artist>

I have tried this and numerous variations:

grep -Po '(?<=<artist mbid=.*?>).*?(?=</artist>)'

In almost all cases, I get a "grep: lookbehind assertion is not fixed length". In Perl, iirc, \K is a solution to that error, but I'm not sure precisely where to put that (I'm a regexp novice, in case you couldn't tell). I've been unsuccessful with simple trial-and-error.

I've spent a few hour searching SO and Google, and I couldn't find anything similar enough to be of help (possibly I missed something). So, my question is: Using grep, how can I extract the text in-between tags when those tags include unique alphanumeric identifiers?

Upvotes: 0

Views: 596

Answers (1)

mathematical.coffee
mathematical.coffee

Reputation: 56935

To use \K ("keep out" -- drop what is matched so far), try

grep -oP '<artist mbid=.*?>\K.*?(?=</artist>)'

i.e. you put the \K after the bit you want to drop from the match.

Otherwise if you want to use the lookbehind:

If you do not expect a '>' to be in an artist's name, and you expect your XML to be well-formed (no mismatching tags), and you don't expect tags to be nested inside an artist, try

grep -Po '(?<=>)[^>]+(?=</artist>)'

Upvotes: 2

Related Questions