Reputation: 1395
I'm trying to retrieve all text between <td>
and</td>
, but I only get the first match in my collection. Do I need a *
or something? Here is my code.
string input = @"<tr class=""row0""><td>09/08/2013</td><td><a href=""/teams/nfl/new-england-patriots/results"">New England Patriots</a></td><td><a href=""/boxscore/2013090803"">L, 23-21</a></td><td align=""center"">0-1-0</td><td align=""right"">65,519</td></tr>";
string pattern = @"(?<=<td>)[^>]*(?=</td>)";
MatchCollection matches = Regex.Matches(input, pattern);
foreach (Match match in matches)
{
try
{
listBoxControl1.Items.Add(matches.ToString());
}
catch { }
}
Upvotes: 7
Views: 28776
Reputation: 32797
HTML(except XHTML) is not strict i.e in some cases
regex is not suitable for parsing such complex grammar.You need to use a parser..
Use htmlagilitypack parser
You can use this code to retrieve it using HtmlAgilityPack
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var tdList = doc.DocumentNode.SelectNodes("//td")
.Select(p => p.InnerText)
.ToList();
Upvotes: 4
Reputation: 619
I found a solution here http://geekcoder.org/js-extract-hashtags-from-text/ from Nicolas Durand - it seems to work pretty well:
#[^ :\n\t\.,\?\/’'!]+
Best regards, Phil
Upvotes: 0
Reputation: 116
Use the following regex expression:
string input = "<tr class=\"row0\"><td>09/08/2013</td><td><a href=\"/teams/nfl/new-england-patriots/results\">New England Patriots</a></td><td><a href=\"/boxscore/2013090803\">L, 23-21</a></td><td align=\"center\">0-1-0</td><td align=\"right\">65,519</td></tr>";
string pattern = "(<td>)(?<td_inner>.*?)(</td>)";
MatchCollection matches = Regex.Matches(input, pattern);
foreach (Match match in matches) {
try {
Console.WriteLine(match.Groups["td_inner"].Value);
}
catch { }
}
Upvotes: 9