Mike_OBrien
Mike_OBrien

Reputation: 1423

WebClient.DownloadString getting gibberish

I recently wrote a web crawler as a side project, the way I structured it is that it uses the System.Net.WebClient's DownloadString function to download the specified addresses html and then does some string manipulation to pull out all of the links contained in the html and then repeats the process on all of the links that it finds(skipping anything its already crawled).

It works fine for most addresses but when I start it with www.yahoo.com as the seed it does something very odd. Instead of getting the html markup back from the DownloadString call it is getting a bunch of gibberish back.

My understanding of the DownloadString function is that it basically would give back what you see when you view page source on a webpage but this can't be the case because when I do that on www.yahoo.com in a browser I view the HTML as expected.

Looking at it very briefly my initial thought was that it looks like the string was encoded with a different encoder than the one that was used to decode it but I don't see a way to manually set what encoding to use when downloading the string via the System.Net.WebClient class.

This is a portion of the text that I receive:

‹Ä½y“£FÖ7úÿó)4í™ûQ«Ä.è;^´ïû~ûvH€Ö ÷›€ÈL©ªì‰{­» gÉ“'OîÉ¿ÿQî•Æ‹~%cûÿùwøŸŒ¥þþEÜ¥|ÉØ’cüþEsr“Ñ—ŒbK¾KËlδâÚûãg윻2}×Ïy€S°õ3úü/w 2žB†©š.íí ³±+·7s®“9XÚQórže˜AƼŒªùëÀÝfÊ×ÿÊë€" µdÙ¾¤k_2~p¶µß¿È

So my initial question is, does anyone know if I am doing something wrong when pulling the html from www.yahoo.com? And if so, is there another way that I should be pulling the html? My next question is, if this is by design then how are they accomplishing this? And why would they be scrambling it up? Are they trying to keep competitors from crawling their website?

Upvotes: 2

Views: 206

Answers (1)

Matt Wilko
Matt Wilko

Reputation: 27342

It seems that yahoo is particular about the user agent. You can specify this to get the appropriate plain text response:

    Using webRequest As WebClient = New WebClient
        webRequest.Headers(HttpRequestHeader.UserAgent) = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727)"
        Dim url As String = "http://www.yahoo.com"
        Dim webPage As String = webRequest.DownloadString(url)
        Debug.WriteLine(webPage)
    End Using

Upvotes: 2

Related Questions