Reputation: 10552

RegEX style for HTML code

Hey all, what would the regEX code be for the following:

<br/><span class=""synopsis-view-synopsis"">America's justice system comes under indictment in director <a href='/people/1035' class='actor' style='font-weight:bold'>Norman Jewison</a>'s trenchant film starring <a href='/people/1028' class='actor' style='font-weight:bold'>Al Pacino</a> as upstanding attorney Arthur Kirkland. A hard-line -- and tainted -- judge (<a href='/people/1034' class='actor' style='font-weight:bold'>John Forsythe</a>) stands accused of rape, and Kirkland (<a href='/people/1028' class='actor' style='font-weight:bold'>Al Pacino</a>) has to defend him. Kirkland has a history with the judge, who jailed one of the lawyer's clients on a technicality. When the judge confesses his guilt, Kirkland faces an ethical and legal quandary. </span>

Ive tried this:

regex = New System.Text.RegularExpressions.Regex("(?<=""synopsis-view-synopsis""\>)([^<\/span><]+)")

But that only seems to get the first part of the description; Americ

Any help would be great! :o)

David

Upvotes: 0

Answers (3)

Alan Moore

Reputation: 75252

I don't see any need for lookaheads or lookbehinds here; just match the whole <span> element and use a capturing group extract its content. Assuming there will never be any <span> elements inside the one you're matching, this should be all you need:

Regex rgx = new Regex(
    @"<span\s+class=""synopsis-view-synopsis"">(.*?)</span>",
    RegexOptions.IgnoreCase | RegexOptions.Singleline);

foreach (Match m in rgx.Matches(s0))
{
  Console.WriteLine(m.Groups[1].Value);
}

Also, [^<\/span><]+ doesn't do what you probably think it does. What you've got there is a character class that matches any one character except <, /, s, p, a, n, or >. You may have been trying for this:

(?:(?!</span>).)+

...which matches one character at a time, after the lookahead confirms that the character isn't the beginning of the sequence </span>. It's a valid technique, but (as with the lookarounds) I don't think you need anything so fancy here.

Upvotes: 1

bw_üezi

Reputation: 4574

in .net there are different methods for "match" and "matches all" these are:

re.Match(str);   // regex 're' match in string 'str'
re.Matches(str)  // regex 're' matches all in string 'str'

update

Explain to regex

(?<=regex) is positive lookbehind
(?!regex) is a negativ lookahead
.+ finally matches anything between the lookaround

Raw Match Pattern:

(?<=""synopsis-view-synopsis""\>).+(?!</span>)

C#.NET Code Example:

using System;
using System.Text.RegularExpressions;
namespace myapp
{
  class Class1
    {
      static void Main(string[] args)
        {
          String sourcestring = 
            "<br/><span class=""synopsis-view-synopsis"">America's justice... </span>
             <br/><span class=""synopsis-view-synopsis"">Canada's justice... </span>";

          Regex re = new Regex(@"(?<=""""synopsis-view-synopsis""""\>).+(?!</span>)");
          MatchCollection mc = re.Matches(sourcestring);
          int mIdx=0;
          foreach (Match m in mc)
           {
            for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
              {
                Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames()[gIdx], m.Groups[gIdx].Value);
              }
            mIdx++;
          }
        }
    }
}

Matches Found:

[0][0] = America's justice... </span>
[1][0] = Canada's justice... </span>

Upvotes: 0

Jakub Hampl

Reputation: 40563

(?=""synopsis-view-synopsis""\>).+(?!<\/span>)

Should probably work. Try using an HTML parser instead!

Upvotes: 0

RegEX style for HTML code

Answers (3)

Related Questions