Reputation: 10888
I am trying to get all strings enclosed in <*> by using following Regex:
Regex regex = new Regex(@"\<(?<name>\S+)\>", RegexOptions.IgnoreCase);
string name = e.Match.Groups["name"].Value;
But in some cases where I have text like :
<Vendors><Vtitle/> <VSurname/></Vendors>
It's returning two strings instead of four, i.e. above Regex outputs
<Vendors><Vtitle/> //as one string and
<VSurname/></Vendors> //as second string
Where as I am expecting four strings:
<Vendors>
<Vtitle/>
<VSurname/>
</Vendors>
Could you please guide me what change I need to make to my Regex.
I tried adding '\b' to specify word boundry
new Regex(@"\b\<(?<name>\S+)\>\b", RegexOptions.IgnoreCase);
, but that didn't help.
Upvotes: 0
Views: 9124
Reputation: 192457
Your regex is using \S+ as the wildcard. In english, this is "a series of one or more characters, none of which is non-whitespace". In other words, when the regex <(?<name>\S+)>
is applied to this string: '`, the regex will match the entire string. angle brackets are non-whitespace.
I think what you want is "a series of one or more characters, none of which is an angle bracket".
The regex for that is <(?<name>[^>]+)>
.
Ahhh, regular expressions. The language designed to look like cartoon swearing.
Upvotes: 4
Reputation: 29143
Regexes are the wrong tool for parsing XML. Try using the System.Xml.Linq
(XElement
) API.
Upvotes: 6
Reputation: 41378
You'll get most of what what you want by using the regex /<([^>]*)>/
. (No need to escape the angle brackets' as angle brackets aren't special characters in most regex engines, including the .NET engine.) The regex I provided will also capture trailing whitespace and any attributes on the tag--parsing those things reliably is way, way beyond the scope of a reasonable regex.
However, be aware that if you're trying to parse XML/HTML with a regex, that way lies madness
Upvotes: 10