Reputation: 45632
I have an input like the following
[a href=http://twitter.com/suddentwilight][font][b][i]@suddentwilight[/font][/a] My POV: Rakhi Sawant hits below the belt & does anything for attention... [a href=http://twitter.com/mallikaLA][b]http://www.test.com[/b][/a] has maintained the grace/decency :)
Now I need to get the string @suddentwilight
and http://www.test.com
that comes inside the anchor tags. there might be some [b] or [i] like tags wrapping the actual text. I need to ignore that.
Basically I need to get a string matching that starts with [a]
then need to get the string/url before closing of the a tag [/a]
.
Please Suggest
Upvotes: 0
Views: 771
Reputation: 89043
I don't know C#, but here's a regex:
/\[a\s+[^\]]*\](?:\[[^\]]+\])*(.*?)(?:\[[^\]]+\])*\[\/a\]/
This will match [a ...][tag1][tag2][...][tagN]text[/tagN]...[tag2][tag1][/a]
and capture text
.
To explain:
/.../
are common regex delimiters (like double quotes for strings). C# may just use strings to initialize regexes - in which case the forward slashes aren't necessary.\[
and \]
match a literal [
and ]
character. We need to escape them with a backslash since square brackets have a special meaning in regexes.[^\]]
is an example of a character class - here meaning any character that is not a close square bracket. The square brackets delimit the character class, the caret (^
) denotes negation, and the escaped close square bracket is the character being negated.*
and +
are suffixes meaning match 0 or more and 1 or more of the previous pattern, respectively. So [^\]]*
means match 0 or more of anything except a close square bracket.\s
is a shorthand for the character class of whitespace characters(?:...)
allows you to group the contents into an atomic pattern.(...)
groups like (?:...)
does, but also saves the substring that this portion of the regex matches into a variable. This is normally called a capture, since it captures this portion of the string for you to use later. Here, we are using a capture to grab the linktext..
matches any single character.*?
is a suffix for non-greedy matching. Normally, the *
suffix is greedy, and matches as much as it can while still allowing the rest of the pattern to match something. *?
is the opposite - it matches as little as it can while still allowing the rest of the pattern to match something. The reason we use *?
here instead of *
is so that if we have multiple [/a]
s on a line, we only go as far as the next one when matching link text.This will only remove [tag]
s that come at the beginning and end of the text, to remove any that come in the middle of the text (like [a href=""]a [b]big[/b] frog[/a]
), you'll need to do a second pass on the capture from the first, scrubbing out any text that matches:
/\[[^\]]+\]/
Upvotes: 3