Parsing Information out of a Scraped Screen (HTML)

Question

I'm trying to have my program "rip" news off of a website and place it on the WinForm, but my method is so dumb and redundant, I'm sure there must be a better way to do it.

public void LoadLatestNews()
{
    WebClient TheWebClient = new WebClient();
    string SourceCode = TheWebClient.DownloadString("http://www.chronic-domination.com/");
    int NewsPosition = SourceCode.IndexOf("news_post-title");

    string Y = SourceCode.Substring(NewsPosition,5000);
    int TitlePosition = Y.IndexOf("
");

    string NewsPostBody = X.Substring((BodyPosition + 16)+ EndBodyPosition);

    MessageBox.Show(NewsPostTitle);

}

Not only is this code horrible, it doesn't even work as intended. So I beg you, teach me the proper way to do things like this?

Rex M · Accepted Answer

Use the Html Agility Pack to parse the page. You can load the entire text of the page and then treat it as XML - write XPATH expressions or crawl the DOM tree to get what you need.

This allows you to avoid the problem of "scraping" at all and approach the task as you would any other XML store. Here's a very basic intro to XPATH. You could write something like myDoc.SelectSingleNode("//div[@class='header']/h2").InnerText, which means "select the H2 element which is an immediate child of the DIV whose class is 'header'", and then getting the inner text of that element.

Parsing Information out of a Scraped Screen (HTML)

Answers (2)

Related Questions