Reputation: 836
I have the following code in C#
which gets the contents of a web page and stores them in a string variable.
WebRequest request = WebRequest.Create("http://www.arsenal.com");
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();
string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
html = sr.ReadToEnd();
}
The code works properly but m I need to store the content of the page without the html
tags and Javascript
stuff. Is there any way to do so (any built-in method or something ready for such things)?
Actually I have found some ways for removing html
tags but Javascript
and CSS
styles still bother me. I have to mention that the way for removing html
is also not working well, I'm using regular expressions for doing so.
Upvotes: 0
Views: 330
Reputation: 6409
As this question suggests, it's a tricky process parsing HTML and the best approach is to use a library.
I've used the HTML Agility Pack before with some success though this question lists some other options.
Upvotes: 2