SubSevn
SubSevn

Reputation: 1028

Regex capture words inside tags

Given an XML document, I'd like to be able to pick out individual key/value pairsfrom a particular tag:

<aaa>key0:val0 key1:val1 key2:va2</aaa>

I'd like to get back

key0:val0 key1:val1 key2:val2

So far I have (?<=<aaa>).*(?=<\/aaa>)

Which will match everything inside, but as one result.

I also have [^\s][\w]*:[\w]*[^\s] which will also match correctly in groups on this:

key0:val0 key1:val1 key2:va2

But not with the tags. I believe this is an issue with searching for subgroups and I'm not sure how to get around it.

Thanks!

Upvotes: 0

Views: 130

Answers (2)

Jdell64
Jdell64

Reputation: 1

I uploaded this one at parsemarket, and I'm not sure its what you are looking for, but maybe something like this:

(<aaa>)((\w+:\w+\s)*(\w+:\w+)*)(<\/aaa>)

AFAIK, unless you know how many k:v pairs are in the tags, you can't capture all of them in one regex. So, if there are only three, you could do something like this:

<(?:aaa)>(\w+:\w+\s*)+(\w+:\w+\s*)+(\w+:\w+\s*)+<(?:\/aaa)>

But I would think you would want to do some sort of loop with whatever language you are using. Or, as some of the comments suggest, use the parser classes in the language. I've used BeautifulSoup in Python for HTML.

Upvotes: 0

Andrei Vajna II
Andrei Vajna II

Reputation: 4842

You cannot combine the two expressions in the way you want, because you have to match each occurrence of "key:value".

So in what you came up with - (?<=<abc>)([\w]*:[\w]*[\s]*)+(?=<\/abc>) - there are two matching groups. The bigger one matches everything inside the tags, while the other matches a single "key:value" occurrence. The regex engine cannot give each individual occurence because it does not work that way. So it just gives you the last one.

If you think in python, on the matcher object obtained after applying you regex, you will have access to matcher.group(1) and matcher.group(2), because you have two matching ( ) groups in the regex.

But what you want is the n occurences of "key:value". So it's easier to just run the simpler \w+:\w+ regex on the string inside the tags.

Upvotes: 1

Related Questions