Leviathan
Leviathan

Reputation: 31

How to extract one word or some words from an HTML page C#

Here I'm trying to extract one word from an HTML page. For example, there are two textboxes (1 and 2). now I'm trying to give stackoverflow question ID on textbox1 and get "asked" value on textbox2. For example, if I give 36 on textbox1 this should give me "9 years, 4 months ago" on textbox2. WebClient webpage = new WebClient(); String html = webpage.DownloadString("https://stackoverflow.com/questions/" + textBox1.Text); MatchCollection match = Regex.Matches(html, FILTERHERE, RegexOptions.Singleline); The problem is I don't know how to filter my output (FILTERHERE)? Also how can I send my output into textbox2?

Upvotes: 1

Views: 391

Answers (2)

Daniel Manta
Daniel Manta

Reputation: 6683

With HtmlAgilityPack.

string url = "https://stackoverflow.com/questions/";
var web = new HtmlWeb();
var doc = web.Load(url + textBox1.Text); //the text is "36"
var tag = doc.DocumentNode.SelectSingleNode("//*[@id='qinfo']//td[./p[@class='label-key' and text()='asked']]/following-sibling::td//b");
textBox2.Text = tag.InnerText;

If you don't know XPath, there are browser extensions for Chrome and Firefox that gets the XPath of any Html tag for you (I personally write them manually to make them less sensitive to changes on page structure).

Upvotes: 3

gembird
gembird

Reputation: 14053

With Windows Forms applicationWebBrowser control canbe used wthich wpapps the mshtml library and exposes managed HTML DOM. Example of function which retrieves the asked text:

private static string GetAskedText(HtmlDocument doc)
{
    if (doc == null)
        return "document-null";
    IEnumerable<mshtml.HTMLDivElement> divs = doc.GetElementsByTagName("div")
        .OfType<HtmlElement>()
        .Select(e => e.DomElement as mshtml.HTMLDivElement);
    foreach (var div in divs)
    {
        if (string.IsNullOrWhiteSpace(div?.className))
            continue;
        if (div.className.Trim().ToLower() != "user-info")
            continue;
        var spans = div.getElementsByTagName("span").OfType<mshtml.HTMLSpanElement>();
        foreach (var span in spans)
        {
            if (string.IsNullOrWhiteSpace(span?.className))
                continue;
            if (span.className == "relativetime")
            {
                return span.innerText;
            }
        }
    }

    return "not-found";
}

Complete example with Windows Forms application can be downloaded from my dropbox.

enter image description here

Upvotes: 2

Related Questions