Marcus
Marcus

Reputation: 127

How to get text off a webpage?

I want to get text off of a webpage in C#.
I don't want to get the HTML, I want the real text off of the webpage. Like if I type "<b>cake</b>", I want the cake, not the tags.

Upvotes: 2

Views: 3520

Answers (3)

Reza ArabQaeni
Reza ArabQaeni

Reputation: 4908

Use the HTML Agility Pack library.

That's very fine library for parse HTML, for your requirement use this code:

    HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
    HtmlAgilityPack.HtmlDocument doc = web.Load("Yor Path(local,web)"); 
    var result=doc.DocumentNode.SelectNodes("//body//text()");//return HtmlCollectionNode
    foreach(var node in result)
    {
        string AchivedText=node.InnerText;//Your desire text
    }

Upvotes: 4

Ry-
Ry-

Reputation: 225281

You can strip tags using regular expressions such as this one2 (a simple example):

// You can import System.Text.RegularExpressions for convenience, of course.
System.Text.RegularExpressions.Regex tag = new System.Text.RegularExpressions.Regex("\<.+?\>");
myHTML = tag.Replace(myHTML, String.Empty);

But if you need to retrieve large volumes of well-structured data, then you might be better off using an HTML library1. (If the webpage is XHTML, all the better - use the System.Xml classes.)

1 Like http://htmlagilitypack.codeplex.com/, for example.
2 This might have unintended side-effects if you're trying to get data out of JavaScript, or if the data is inside an element's attribute and includes angle brackets. You'll also need to accept escape sequences like &amp;.

Upvotes: 1

kol
kol

Reputation: 28728

It depends. If your application downloads the webpage using a WebBrowser component, then that component will do the parsing for you automatically in the background (just like Internet Explorer). Just walk the DOM tree and extract the text you want. You will find HtmlElement.InnerText property especially useful :)

Upvotes: 1

Related Questions