f4d0
f4d0

Reputation: 1392

C# Regular Expressions - Get Second Number, not First

I have the following HTML code:

<td class="actual">106.2% </td> 

Which I get the number through two phases:

Regex.Matches(html, "<td class=\"actual\">\\s*(.*?)\\s*</td>", RegexOptions.Singleline);
Regex.Match(m.Groups[1].Value, @"-?\d+.\d+").Value

The above code lines gives me what I want, the 106.2

The problem is that sometimes the HTML can be a little different, like this:

<td class="actual"><span class="revised worse" title="Revised From 107.2%">106.4%</span></td>

In this last case, I can only get the 107.2, and I would like to get the 106.4 Is there some regular expression trick to say, I want the second number in the sentence and not the first?

Upvotes: 0

Views: 349

Answers (4)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626748

Whenver you have HTML code that comes from different providers or your current one has several CMS that use different HTML formatting style, it is not safe to rely on regex.

I suggest an HtmlAgilityPack based solution:

public string getCleanHtml(string html)
{
    var doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);
    return HtmlAgilityPack.HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);
}

And then:

var txt = "<td class=\"actual\">106.2% </td>";
var clean = getCleanHtml(txt);
txt = "<td class=\"actual\"><span class=\"revised worse\" title=\"Revised From 107.2%\">106.4%</span></td>";
clean = getCleanHtml(txt);

Result: enter image description here and enter image description here

You do not have to worry about formatting tags inside and any XML/HTML entity references.

If your text is a substring of the clean HTML string, then you can use Regex or any other string manipulation methods.

UPDATE:

You seem to need the node values from <td> tags. Here is a handy method for you:

private List<string> GetTextFromHtmlTag(string html, string tag)
{
   var result = new List<string>();
   HtmlAgilityPack.HtmlDocument hap;
   Uri uriResult;
   if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
   { // html is a URL 
       var doc = new HtmlAgilityPack.HtmlWeb();
       hap = doc.Load(uriResult.AbsoluteUri);
   }
   else
   { // html is a string
       hap = new HtmlAgilityPack.HtmlDocument();
       hap.LoadHtml(html);
   }
   var nodes = hap.DocumentNode.ChildNodes.Where(p => p.Name.ToLower() == tag.ToLower() && p.GetAttributeValue("class", string.Empty) == "previous"); // SelectNodes("//"+tag);
    if (nodes != null)
        foreach (var node in nodes)
           result.Add(HtmlAgilityPack.HtmlEntity.DeEntitize(node.InnerText));
    return result;
}

You can call it like this:

var html = "<td class=\"previous\"><span class=\"revised worse\" title=\"Revised From 1.3\">0.9</span></td>\n<td class=\"previous\"><span class=\"revised worse\" title=\"Revised From 107.2%\">106.4%</span></td>";
var res = GetTextFromHtmlTag(html, "td");

enter image description here

If you need to get only specific tags,

If you have texts with a number inside, and you need just the number, you can use a regex for that:

var rx = new Regex(@"[+-]?\d*\.?\d+"); // Matches "-1.23", "+5", ".677"

See demo

Upvotes: 2

Sky Fang
Sky Fang

Reputation: 1101

string html = @"<td class=""actual""><span class=""revised worse"" title=""Revised From 107.2%"">106.4%</span></td>
<td class=""actual"">106.2% </td>";
string patten = @"<td\s+class=""actual"">.*(?<=>)(.+?)(?=</).*?</td>";
foreach (Match match in Regex.Matches(html, patten))
{
    Console.WriteLine(match.Groups[1].Value);
}

I have changed the regex as your wish, The output is

106.4%
106.2%

Upvotes: 1

f4d0
f4d0

Reputation: 1392

I want to share the solution I have found for my problem.

So, I can have HTML tags like the following:

<td class="previous"><span class="revised worse" title="Revised From 1.3">0.9</span></td>
<td class="previous"><span class="revised worse" title="Revised From 107.2%">106.4%</span></td>

Or simpler:

<td class="previous">51.4</td>

First, I take the entire line, throught the following code:

MatchCollection mPrevious = Regex.Matches(html, "<td class=\"previous\">\\s*(.*?)\\s*</td>", RegexOptions.Singleline);

And second, I use the following code to extract the numbers only:

foreach (Match m in mPrevious)
        {


            if (m.Groups[1].Value.Contains("span"))
            {
                string stringtemp = Regex.Match(m.Groups[1].Value, "-?\\d+.\\d+.\">-?\\d+.\\d+|-?\\d+.\\d+\">-?\\d+.\\d+|-?\\d+.\">-?\\d+|-?\\d+\">-?\\d+").Value;
                int indextemp = stringtemp.IndexOf(">");
                if (indextemp <= 0) break;
                lPrevious.Add(stringtemp.Remove(0, indextemp + 1));
            }
            else lPrevious.Add(Regex.Match(m.Groups[1].Value, @"-?\d+.\d+|-?\d+").Value);
        }

First I start to identify if there is a SPAN tag, if there is, I take the two number together, and I have considered diferent posibilities with the regular expression. Identify a character from where to remove non important information, and remove what I don't want.

It's working perfect.

Thank you all for the support and quick answers.

Upvotes: 1

jdweng
jdweng

Reputation: 34421

Try XML method

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;


namespace ConsoleApplication34
{
    class Program
    {

        static void Main(string[] args)
        {
            string input = "<td class=\"actual\"><span class=\"revised worse\" title=\"Revised From 107.2%\">106.4%</span></td>";

            XElement element = XElement.Parse(input);

            string value = element.Descendants("span").Select(x => (string)x).FirstOrDefault();

        }

    }

}

Upvotes: 1

Related Questions