Reputation: 10552
Hey all, what would the regEX code be for the following:
<br/><span class=""synopsis-view-synopsis"">America's justice system comes under indictment in director <a href='/people/1035' class='actor' style='font-weight:bold'>Norman Jewison</a>'s trenchant film starring <a href='/people/1028' class='actor' style='font-weight:bold'>Al Pacino</a> as upstanding attorney Arthur Kirkland. A hard-line -- and tainted -- judge (<a href='/people/1034' class='actor' style='font-weight:bold'>John Forsythe</a>) stands accused of rape, and Kirkland (<a href='/people/1028' class='actor' style='font-weight:bold'>Al Pacino</a>) has to defend him. Kirkland has a history with the judge, who jailed one of the lawyer's clients on a technicality. When the judge confesses his guilt, Kirkland faces an ethical and legal quandary. </span>
Ive tried this:
regex = New System.Text.RegularExpressions.Regex("(?<=""synopsis-view-synopsis""\>)([^<\/span><]+)")
But that only seems to get the first part of the description; Americ
Any help would be great! :o)
David
Upvotes: 0
Views: 492
Reputation: 75252
I don't see any need for lookaheads or lookbehinds here; just match the whole <span>
element and use a capturing group extract its content. Assuming there will never be any <span>
elements inside the one you're matching, this should be all you need:
Regex rgx = new Regex(
@"<span\s+class=""synopsis-view-synopsis"">(.*?)</span>",
RegexOptions.IgnoreCase | RegexOptions.Singleline);
foreach (Match m in rgx.Matches(s0))
{
Console.WriteLine(m.Groups[1].Value);
}
Also, [^<\/span><]+
doesn't do what you probably think it does. What you've got there is a character class that matches any one character except <
, /
, s
, p
, a
, n
, or >
. You may have been trying for this:
(?:(?!</span>).)+
...which matches one character at a time, after the lookahead confirms that the character isn't the beginning of the sequence </span>
. It's a valid technique, but (as with the lookarounds) I don't think you need anything so fancy here.
Upvotes: 1
Reputation: 4574
in .net there are different methods for "match" and "matches all" these are:
re.Match(str); // regex 're' match in string 'str'
re.Matches(str) // regex 're' matches all in string 'str'
update
Explain to regex
(?<=regex)
is positive lookbehind(?!regex)
is a negativ lookahead.+
finally matches anything between the lookaround Raw Match Pattern:
(?<=""synopsis-view-synopsis""\>).+(?!</span>)
C#.NET Code Example:
using System;
using System.Text.RegularExpressions;
namespace myapp
{
class Class1
{
static void Main(string[] args)
{
String sourcestring =
"<br/><span class=""synopsis-view-synopsis"">America's justice... </span>
<br/><span class=""synopsis-view-synopsis"">Canada's justice... </span>";
Regex re = new Regex(@"(?<=""""synopsis-view-synopsis""""\>).+(?!</span>)");
MatchCollection mc = re.Matches(sourcestring);
int mIdx=0;
foreach (Match m in mc)
{
for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
{
Console.WriteLine("[{0}][{1}] = {2}", mIdx, re.GetGroupNames()[gIdx], m.Groups[gIdx].Value);
}
mIdx++;
}
}
}
}
Matches Found:
[0][0] = America's justice... </span>
[1][0] = Canada's justice... </span>
Upvotes: 0
Reputation: 40563
(?=""synopsis-view-synopsis""\>).+(?!<\/span>)
Should probably work. Try using an HTML parser instead!
Upvotes: 0