Reputation: 1477
I'm making a Web Crawler and I just found out that one of my methods, GetHTML, is very slow because it uses a StreamReader to get a string of the HTML out of the HttpWebResponse object.
Here is the method:
static string GetHTML(string URL)
{
HttpWebRequest Request = (HttpWebRequest)WebRequest.Create(URL);
Request.Proxy = null;
HttpWebResponse Response = ((HttpWebResponse)Request.GetResponse());
Stream RespStream = Response.GetResponseStream();
return new StreamReader(RespStream).ReadToEnd(); // Very slow
}
I made a test with Stopwatch and used this method on YouTube.
Time it takes to get an HTTP response: 500 MS
Time it takes to convert the HttpWebResponse object to a string: 550 MS
So the HTTP request is fine, it's just the ReadToEnd() that is so slow.
Is there any alternative to the ReadToEnd() method to get an HTML string from the response object? I tried using WebClient.DownloadString() method, but it's just a wrapper around HttpWebRequest that uses streams too.
EDIT: Tried it with Sockets and it's much faster:
static string SocketHTML(string URL)
{
string IP = Dns.GetHostAddresses(URL)[0].ToString();
Socket s = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
s.Connect(new IPEndPoint(IPAddress.Parse(IP), 80));
s.Send(Encoding.ASCII.GetBytes("GET / HTTP/1.1\r\n\r\n"));
List<byte> HTML = new List<byte>();
int Bytes = 1;
while (Bytes > 0)
{
byte[] Data = new byte[1024];
Bytes = s.Receive(Data);
foreach (byte b in Data) HTML.Add(b);
}
s.Close();
return Encoding.ASCII.GetString(HTML.ToArray());
}
The problem with using it with Sockets, though, is that it most of the time returns errors such as "Moved Permanently" or "Your browser sent a request that the server could not understand".
Upvotes: 0
Views: 7052
Reputation: 726479
I made this comparison to see if the
StreamReader.ReadToEnd()
is the bottleneck, and I've seen it is.
You jumped to a wrong conclusion here: the bottleneck is the whole method, not just its StreamReader.ReadToEnd()
portion.
When I receive the response and I don't use the
ReadToEnd()
method, it takes about 500 MS, but if I use theReadToEnd()
method it takes 1000 MS.
That's the thing - an ability to call Response.GetResponseStream()
does not mean that you "got a response". All you get is a confirmation that the response is there.
In a real world this would be similar to receiving a parcel for which you must sign at the post office. Post office will put a postcard into your mailbox saying that there is a delivery waiting for you at the post office. That's your Response.GetResponseStream()
call. But at this point you do not have your parcel, only a postcard that says the parcel is there. Now you need to go to the post office, show them the card, and retrieve the parcel. That's the StreamReader.ReadToEnd()
call.
The time nearly doubles because most of 1000 ms is spent communicating with a remote server. If you need the entire response, there is little you can do about speeding this up. The good news is that since the time is spent in I/O, there is a good chance that you would be able to parallelize this code for retrieving data from multiple web sites (assuming that you do not load your network to capacity).
Upvotes: 2
Reputation: 1499770
When I call this method but return String.Empty instead of the ReadToEnd, the method takes about 500 MS.
All that says is that starting to get the response takes 500ms. Calling GetResponseStream
doesn't consume all the data.
ReadToEnd
will also be doing conversion from the binary data to text, but I doubt that's significant - I strongly suspect it's just waiting for the data to arrive over the network. To verify that, you should add logging to every aspect of your code and run Wireshark - you should then be able to see packet-by-packet when the data arrives, and correlate it with the logging.
As a side issue, you should definitely have a using
statement for the response:
using (var response = ((HttpWebResponse)Request.GetResponse())
{
// The stream will be disposed when the response is.
return new StreamReader(response.GetResponseStream())
.ReadToEnd();
}
If you don't dispose of the response, you'll tie up connections until the garbage collector finalizes them. That can lead to timeouts.
Upvotes: 5
Reputation: 700152
It's not the ReadToEnd
method that is slow, it's waiting for the data that takes time.
The ReadToEnd
method is fast enough. I just tested to read a megabyte of data from a memory stream using a stream reader, and it takes only 3 ms.
When you get the response stream from the request, it has only started to get the data that was requested. Once you have read the data already recieved, it has to wait for the rest of the data to arrive. That's what's taking time in the ReadToEnd
call. Using any other means of reading the stream won't make it faster.
Upvotes: 1