Tanmoy
Tanmoy

Reputation: 45632

regex to fetch string between [a] and [/a] excluding any other tag like [b][/b] that comes in between

I have an input like the following

[a href=http://twitter.com/suddentwilight][font][b][i]@suddentwilight[/font][/a] My POV: Rakhi Sawant hits below the belt & does anything for attention... [a href=http://twitter.com/mallikaLA][b]http://www.test.com[/b][/a] has maintained the grace/decency :)

Now I need to get the string @suddentwilight and http://www.test.com that comes inside the anchor tags. there might be some [b] or [i] like tags wrapping the actual text. I need to ignore that.

Basically I need to get a string matching that starts with [a] then need to get the string/url before closing of the a tag [/a].

Please Suggest

Upvotes: 0

Views: 771

Answers (1)

rampion
rampion

Reputation: 89043

I don't know C#, but here's a regex:

/\[a\s+[^\]]*\](?:\[[^\]]+\])*(.*?)(?:\[[^\]]+\])*\[\/a\]/

This will match [a ...][tag1][tag2][...][tagN]text[/tagN]...[tag2][tag1][/a] and capture text.

To explain:

  • the /.../ are common regex delimiters (like double quotes for strings). C# may just use strings to initialize regexes - in which case the forward slashes aren't necessary.
  • \[ and \] match a literal [ and ] character. We need to escape them with a backslash since square brackets have a special meaning in regexes.
  • [^\]] is an example of a character class - here meaning any character that is not a close square bracket. The square brackets delimit the character class, the caret (^) denotes negation, and the escaped close square bracket is the character being negated.
  • * and + are suffixes meaning match 0 or more and 1 or more of the previous pattern, respectively. So [^\]]* means match 0 or more of anything except a close square bracket.
  • \s is a shorthand for the character class of whitespace characters
  • (?:...) allows you to group the contents into an atomic pattern.
  • (...) groups like (?:...) does, but also saves the substring that this portion of the regex matches into a variable. This is normally called a capture, since it captures this portion of the string for you to use later. Here, we are using a capture to grab the linktext.
  • . matches any single character.
  • *? is a suffix for non-greedy matching. Normally, the * suffix is greedy, and matches as much as it can while still allowing the rest of the pattern to match something. *? is the opposite - it matches as little as it can while still allowing the rest of the pattern to match something. The reason we use *? here instead of * is so that if we have multiple [/a]s on a line, we only go as far as the next one when matching link text.

This will only remove [tag]s that come at the beginning and end of the text, to remove any that come in the middle of the text (like [a href=""]a [b]big[/b] frog[/a]), you'll need to do a second pass on the capture from the first, scrubbing out any text that matches:

/\[[^\]]+\]/

Upvotes: 3

Related Questions