Pratik
Pratik

Reputation: 11745

Read only the title and/or META tag of HTML file, without loading complete HTML file

Scenario :

I need to parse millions of HTML files/pages (as fact as I can) & then read only only Title or Meta part of it & Dump it to Database

What I am doing is using System.Net.WebClient Class's DownloadString(url_path) to download & then Saving it to Database by LINQ To SQL

But this DownloadString function gives me complete html source, I just need only Title part & META tag part.

Any ideas, to download only that much content?

Upvotes: 2

Views: 2592

Answers (3)

Samir Adel
Samir Adel

Reputation: 2499

I think you can open a stream with this url and use this stream to read the first x bytes, I can't tell the exact number but i think you can set it to reasonable number to get the title and the description.

HttpWebRequest fileToDownload = (HttpWebRequest)HttpWebRequest.Create("YourURL");
            using (WebResponse fileDownloadResponse = fileToDownload.GetResponse())
            {
                using (Stream fileStream = fileDownloadResponse.GetResponseStream())
                {
                    using (StreamReader fileStreamReader = new StreamReader(fileStream))
                    {
                        char[] x = new char[Number];
                        fileStreamReader.Read(x, 0, Number);
                        string data = "";
                        foreach (char item in x)
                        {
                            data += item.ToString();
                        }
                    }
                }
            }

Upvotes: 2

Zachary
Zachary

Reputation: 6532

You can use the verb "HEAD" in a HttpWebRequest to return the the response headers (not element. To get the full element with the meta data you'll need to download the page and parse out the meta data you want.

System.Net.WebRequest.Create(uri) { Method = "HEAD" };

Upvotes: 0

GeoffM
GeoffM

Reputation: 1611

I suspect that WebClient will try to download the whole page first, in which case you'd probably want a raw client socket. Send the appropriate HTTP request (manually, since you're using raw sockets), start reading the response (which will not be immediately) and kill the connection when you've read enough. However, the rest will have probably already been sent from the server and winging its way to your PC whether you want it or not, so you might not save much - if anything - of the bandwidth.

Depending on what you want it for, many half decent websites have a custom 404 page which is a lot simpler than a known page. Whether that has the information you're after is another matter.

Upvotes: 0

Related Questions