Sergio Tapia
Sergio Tapia

Reputation: 41128

Parsing Information out of a Scraped Screen (HTML)

I'm trying to have my program "rip" news off of a website and place it on the WinForm, but my method is so dumb and redundant, I'm sure there must be a better way to do it.

public void LoadLatestNews()
{
    WebClient TheWebClient = new WebClient();
    string SourceCode = TheWebClient.DownloadString("http://www.chronic-domination.com/");
    int NewsPosition = SourceCode.IndexOf("news_post-title");

    string Y = SourceCode.Substring(NewsPosition,5000);
    int TitlePosition = Y.IndexOf("</div");

    string NewsPostTitle = SourceCode.Substring((NewsPosition + 17), (TitlePosition - 17));

    int BodyPosition = Y.IndexOf("news_post-body");

    string X = Y.Substring(BodyPosition, 1000);
    int EndBodyPosition = X.IndexOf("<br><br>");

    string NewsPostBody = X.Substring((BodyPosition + 16)+ EndBodyPosition);

    MessageBox.Show(NewsPostTitle);

}

Not only is this code horrible, it doesn't even work as intended. So I beg you, teach me the proper way to do things like this?

Upvotes: 1

Views: 936

Answers (2)

Rex M
Rex M

Reputation: 144112

Use the Html Agility Pack to parse the page. You can load the entire text of the page and then treat it as XML - write XPATH expressions or crawl the DOM tree to get what you need.

This allows you to avoid the problem of "scraping" at all and approach the task as you would any other XML store. Here's a very basic intro to XPATH. You could write something like myDoc.SelectSingleNode("//div[@class='header']/h2").InnerText, which means "select the H2 element which is an immediate child of the DIV whose class is 'header'", and then getting the inner text of that element.

Upvotes: 4

Nick
Nick

Reputation: 1815

Have a look at Wikipedia's entry on Web Scraping: Here I do a lot of web scraping, and in my experience Regular Expressions are sufficient about 80% of the time. After which, you need to look at parsing the (X)HTML and traversing the DOM tree.

Upvotes: 1

Related Questions