mary
mary

Reputation: 267

How to extract meta tags of a series of URLs without downloading whole html in c#

I want to extract title , description and keywords of a seris of URLs
I have this code

 WebClient x = new WebClient();
 string  pageSource = (x.DownloadString(url));     
 query.title = Regex.Match(pageSource, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;

But I do not want to download whole page because It is so time consuming for a series of URLs. Is there any way to get get these information without downloading whole page?
I should mention that I get these URLs in google search result page buy sending query to google.

Upvotes: 0

Views: 2350

Answers (1)

Aleksey L.
Aleksey L.

Reputation: 37938

You can request and download partial result using HttpClient by specifying range header. You can define the buffer length you want to download and read:

    static void Main()
    {
        Test().GetAwaiter().GetResult();
    }

    private static async Task Test()
    {
        const string url = "http://google.com";
        const int bytesToRead = 2000;

        using (var httpclient = new HttpClient())
        {
            httpclient.DefaultRequestHeaders.Range = new RangeHeaderValue(0, bytesToRead);

            var response = await httpclient.GetAsync(url, HttpCompletionOption.ResponseHeadersRead);

            using (var stream = await response.Content.ReadAsStreamAsync())
            {
                var buffer = new byte[bytesToRead];
                stream.Read(buffer, 0, buffer.Length);

                var partialHtml = Encoding.UTF8.GetString(buffer);
                //extract required info from partial html
            }
        }
    }

Same result could be achieved using "old" WebClient

Upvotes: 3

Related Questions