Reputation: 1392
I have the following HTML code:
<td class="actual">106.2% </td>
Which I get the number through two phases:
Regex.Matches(html, "<td class=\"actual\">\\s*(.*?)\\s*</td>", RegexOptions.Singleline);
Regex.Match(m.Groups[1].Value, @"-?\d+.\d+").Value
The above code lines gives me what I want, the 106.2
The problem is that sometimes the HTML can be a little different, like this:
<td class="actual"><span class="revised worse" title="Revised From 107.2%">106.4%</span></td>
In this last case, I can only get the 107.2, and I would like to get the 106.4 Is there some regular expression trick to say, I want the second number in the sentence and not the first?
Upvotes: 0
Views: 349
Reputation: 626748
Whenver you have HTML code that comes from different providers or your current one has several CMS that use different HTML formatting style, it is not safe to rely on regex.
I suggest an HtmlAgilityPack based solution:
public string getCleanHtml(string html)
{
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
return HtmlAgilityPack.HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);
}
And then:
var txt = "<td class=\"actual\">106.2% </td>";
var clean = getCleanHtml(txt);
txt = "<td class=\"actual\"><span class=\"revised worse\" title=\"Revised From 107.2%\">106.4%</span></td>";
clean = getCleanHtml(txt);
Result: and
You do not have to worry about formatting tags inside and any XML/HTML entity references.
If your text is a substring of the clean HTML string, then you can use Regex or any other string manipulation methods.
UPDATE:
You seem to need the node values from <td>
tags. Here is a handy method for you:
private List<string> GetTextFromHtmlTag(string html, string tag)
{
var result = new List<string>();
HtmlAgilityPack.HtmlDocument hap;
Uri uriResult;
if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
{ // html is a URL
var doc = new HtmlAgilityPack.HtmlWeb();
hap = doc.Load(uriResult.AbsoluteUri);
}
else
{ // html is a string
hap = new HtmlAgilityPack.HtmlDocument();
hap.LoadHtml(html);
}
var nodes = hap.DocumentNode.ChildNodes.Where(p => p.Name.ToLower() == tag.ToLower() && p.GetAttributeValue("class", string.Empty) == "previous"); // SelectNodes("//"+tag);
if (nodes != null)
foreach (var node in nodes)
result.Add(HtmlAgilityPack.HtmlEntity.DeEntitize(node.InnerText));
return result;
}
You can call it like this:
var html = "<td class=\"previous\"><span class=\"revised worse\" title=\"Revised From 1.3\">0.9</span></td>\n<td class=\"previous\"><span class=\"revised worse\" title=\"Revised From 107.2%\">106.4%</span></td>";
var res = GetTextFromHtmlTag(html, "td");
If you need to get only specific tags,
If you have texts with a number inside, and you need just the number, you can use a regex for that:
var rx = new Regex(@"[+-]?\d*\.?\d+"); // Matches "-1.23", "+5", ".677"
See demo
Upvotes: 2
Reputation: 1101
string html = @"<td class=""actual""><span class=""revised worse"" title=""Revised From 107.2%"">106.4%</span></td>
<td class=""actual"">106.2% </td>";
string patten = @"<td\s+class=""actual"">.*(?<=>)(.+?)(?=</).*?</td>";
foreach (Match match in Regex.Matches(html, patten))
{
Console.WriteLine(match.Groups[1].Value);
}
I have changed the regex as your wish, The output is
106.4%
106.2%
Upvotes: 1
Reputation: 1392
I want to share the solution I have found for my problem.
So, I can have HTML tags like the following:
<td class="previous"><span class="revised worse" title="Revised From 1.3">0.9</span></td>
<td class="previous"><span class="revised worse" title="Revised From 107.2%">106.4%</span></td>
Or simpler:
<td class="previous">51.4</td>
First, I take the entire line, throught the following code:
MatchCollection mPrevious = Regex.Matches(html, "<td class=\"previous\">\\s*(.*?)\\s*</td>", RegexOptions.Singleline);
And second, I use the following code to extract the numbers only:
foreach (Match m in mPrevious)
{
if (m.Groups[1].Value.Contains("span"))
{
string stringtemp = Regex.Match(m.Groups[1].Value, "-?\\d+.\\d+.\">-?\\d+.\\d+|-?\\d+.\\d+\">-?\\d+.\\d+|-?\\d+.\">-?\\d+|-?\\d+\">-?\\d+").Value;
int indextemp = stringtemp.IndexOf(">");
if (indextemp <= 0) break;
lPrevious.Add(stringtemp.Remove(0, indextemp + 1));
}
else lPrevious.Add(Regex.Match(m.Groups[1].Value, @"-?\d+.\d+|-?\d+").Value);
}
First I start to identify if there is a SPAN tag, if there is, I take the two number together, and I have considered diferent posibilities with the regular expression. Identify a character from where to remove non important information, and remove what I don't want.
It's working perfect.
Thank you all for the support and quick answers.
Upvotes: 1
Reputation: 34421
Try XML method
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication34
{
class Program
{
static void Main(string[] args)
{
string input = "<td class=\"actual\"><span class=\"revised worse\" title=\"Revised From 107.2%\">106.4%</span></td>";
XElement element = XElement.Parse(input);
string value = element.Descendants("span").Select(x => (string)x).FirstOrDefault();
}
}
}
Upvotes: 1