How to always get the website title without downloading all the page source

Question

Yes, I will concur that at first glance, this looks exactly like a duplicate of the following:

Truth be told... this question is extremely related to those two. However, I noticed that there was a flaw with the code from just about all links I have found so far while researching this particular topic.

Here are some other links that are similar to the above links in content:

If it has to be known, I am getting the URL of the page using this particular method, as outlined in this link, but I presumed that it wouldn't matter:

Dragging URLs to Windows Forms controls in C#

The code from the first link works pretty well, albeit with one big issue:

If, for example, I take the URL from this site: http://www.dotnetperls.com/imagelist

And pass it to the code, which I have a modified version of below:

private static string GetWebPageTitle(string url)
{
    HttpWebRequest request = (HttpWebRequest.Create(url) as HttpWebRequest);
    HttpWebResponse response = (request.GetResponse() as HttpWebResponse);
    using (Stream stream = response.GetResponseStream())
    {
        // compiled regex to check for  block
        Regex titleCheck = new Regex(@"\s*(.+?)\s*", RegexOptions.Compiled | RegexOptions.IgnoreCase);
        int bytesToRead = 8092;
        byte[] buffer = new byte[bytesToRead];
        string contents = "";
        int length = 0;
        while ((length = stream.Read(buffer, 0, bytesToRead)) > 0)
        {
            // convert the byte-array to a string and add it to the rest of the
            // contents that have been downloaded so far
            contents += Encoding.UTF8.GetString(buffer, 0, length);

            Match m = titleCheck.Match(contents);
            if (m.Success)
            {
                // we found a  match =]
                return m.Groups[1].Value.ToString();
                break;
            }
            else if (contents.Contains(""))
            {
                // reached end of head-block; no title found =[
                return null;
                break;
            }
        }
        return null;
    }
}

It returns me a blank result, or null. However, when observing the HTML code of the page, the title tag is most definitely there.

Thus, my question is: How can the code be modified or corrected, from either the modified code I have, or from any of the other four links presented, to also obtain the web page title from all web pages that have the title tag present, one example being the last link in this question, the one from DotNetPerls.

I am merely guessing, but I wonder if the website displays differently from other typical sites, like maybe it doesn't display any code when you load it the first time perhaps, but the browser actually reloads the site after loading it the first time transparently...

I would prefer an answer with some working example code, if possible.

Rob · Accepted Answer

It's not matching the title because the stream is actually the raw stream, in this case, it's been gzipped. (Add a Console.WriteLine(contents) inside the loop to see).

To have the stream automatically decompressed, do this:

request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;

(Solution for automatic decompression taken from here)

How to always get the website title without downloading all the page source

Answers (1)

Related Questions