Reputation: 2008
I'm writing a web crawler for a specific site. The application is a VB.Net Windows Forms application that is not using multiple threads - each web request is consecutive. However, after ten successful page retrievals every successive request times out.
I have reviewed the similar questions already posted here on SO, and have implemented the recommended techniques into my GetPage routine, shown below:
Public Function GetPage(ByVal url As String) As String
Dim result As String = String.Empty
Dim uri As New Uri(url)
Dim sp As ServicePoint = ServicePointManager.FindServicePoint(uri)
sp.ConnectionLimit = 100
Dim request As HttpWebRequest = WebRequest.Create(uri)
request.KeepAlive = False
request.Timeout = 15000
Try
Using response As HttpWebResponse = DirectCast(request.GetResponse, HttpWebResponse)
Using dataStream As Stream = response.GetResponseStream()
Using reader As New StreamReader(dataStream)
If response.StatusCode <> HttpStatusCode.OK Then
Throw New Exception("Got response status code: " + response.StatusCode)
End If
result = reader.ReadToEnd()
End Using
End Using
response.Close()
End Using
Catch ex As Exception
Dim msg As String = "Error reading page """ & url & """. " & ex.Message
Logger.LogMessage(msg, LogOutputLevel.Diagnostics)
End Try
Return result
End Function
Have I missed something? Am I not closing or disposing of an object that should be? It seems strange that it always happens after ten consecutive requests.
Notes:
In the constructor for the class in which this method resides I have the following:
ServicePointManager.DefaultConnectionLimit = 100
If I set KeepAlive to true, the timeouts begin after five requests.
All the requests are for pages in the same domain.
EDIT
I added a delay between each web request of between two and seven seconds so that I do not appear to be "hammering" the site or attempting a DOS attack. However, the problem still occurs.
Upvotes: 7
Views: 8199
Reputation: 11
If the server is using a database and does not close each database connection properly, you may receive an error (e.g. statuscode 502) when the max. connection limited is reached (until the database connection timeout). A solution in this case is only to 'sleep' the webrequest thread for a given time. Furthermore you should ensure that each request and reponse stream is being closed after processing (in best case by using of an 'Using' statement):
Upvotes: 1
Reputation: 3243
I know this is an old question, but I recently had this problem myself (due to my target environment using 4.0 and not allowing any external assembly references)
I did some digging however and found a fix of sorts and is very interesting from a .NET inner-workings perspective
ServicePointManager.DefaultConnectionLimit = 100;
ServicePointManager internally handles the actual HTTP request created by multiple HttpWebRequest objects ..problem is, these don’t get closed automatically and HttpWebRequest doesn’t gets garbage collected immediately
So I found something very interesting – if I make HttpWebRequest an instance level variable AND I force garbage collection after switching the reference out …it works (without the DefaultConnectionLimit = 100 hack)
private HttpWebRequest Request { get; set; }
public void MyMethod() {
Request = (HttpWebRequest)HttpWebRequest.Create("http://myUrl");
GC.Collect();
GC.WaitForFullGCComplete();
}
Before I was creating a new local variable each time in the method. This seemed to fix my problem - probably a little too late to help you but thought I'd share in case anyone else comes across this
Upvotes: 1
Reputation: 21
I used the following solution and it works for me. Hope it helps to you too.
Declare "global" on the form the variables.
HttpWebRequest myHttpWebRequest;
HttpWebResponse myHttpWebResponse;
Then always use myHttpWebResponse.Close();
after each connection.
myHttpWebResponse = (HttpWebResponse)myHttpWebRequest.GetResponse();
myHttpWebResponse.Close();
Upvotes: 2
Reputation: 35135
myRequest.Connection = "Close"; will make the server close the connection which will make the connection manager close the connection too.
Upvotes: 0
Reputation: 3769
I ran into this issue today and my resolution was to ensure that the response was closed at all times.
I think that you need to put in a response.Close() before you throw your exception inside the using.
Using response As HttpWebResponse = DirectCast(request.GetResponse, HttpWebResponse)
Using dataStream As Stream = response.GetResponseStream()
Using reader As New StreamReader(dataStream)
If response.StatusCode <> HttpStatusCode.OK Then
response.Close()
Throw New Exception("Got response status code: " + response.StatusCode)
End If
result = reader.ReadToEnd()
End Using
End Using
response.Close()
End Using
Upvotes: 4
Reputation: 7559
I think the site has some sort of DOS protection, which kicks in when it's hit with a number of rapis requests. You may want to try setting the UserAgent on the webrequest.
Upvotes: 3