Patryk
Patryk

Reputation: 3152

How to get a txt content of a web page?

I've wasted 2 days to find out, that there's a known memory leak in WebBrowser control(since 2007 or so and still, they havent fixed it) so I've decided to simply ask here, how to do the thing I need.

Till now, (using WebBrowser...), I've been visiting a site, (ctrl+a), paste it to a string and that was all. I had text content of a web page in my string. Worked perfectly untill I found out that it takes 1 gb of memory after some time. Is it possible to do that through HttpWebRequest, httpwebclient or anything?

Thanks for replies, there wasn't any thread like that (or I havent found any, searching didnt really take me much coz Im really pissed off now :P)

FORGOT TO ADD: I don't want HTML code, I know it's possible to get it easily. In my case, html code is useless. I do need the text user see while opening the page with internet browser.

Upvotes: 0

Views: 1965

Answers (4)

Internet Engineer
Internet Engineer

Reputation: 2534

Why don't you use the free open source HTML scraper like Ncrawler.

It is written in c#.

ncrawler.codeplex.com

You can get examples on how to use it here.

Upvotes: 1

woz
woz

Reputation: 10994

You can use this:

string getHtml(string url) {
   HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
   request.Method = "GET";
   HttpWebResponse response = (HttpWebResponse)request.GetResponse();
   StreamReader source = new StreamReader(myWebResponse.GetResponseStream());
   string pageSourceStr = string.Empty;
   pageSourceStr= source.ReadToEnd();
   response.Close();
   return pageSourceStr;
}

You still have to do some substring replacement to reduce it from html to text. It's not too bad if you just want text from a certain div.

Upvotes: 2

kay.one
kay.one

Reputation: 7692

This will download the html content from any webpage.

WebClient client = new WebClient ();
string reply = client.DownloadString ("http://www.google.com");

Upvotes: 2

L.B
L.B

Reputation: 116108

using (WebClient client = new WebClient())
{
    string html = client.DownloadString("http://stackoverflow.com/questions/10839877/how-to-get-a-txt-content-of-a-web-page");
}

Upvotes: 7

Related Questions