Reputation: 1205
Thing is: I have a webcrawler framework, and independent modules that implement this framework. All of these modules capture news from news specific websites.
In the framework there are 2 unpredictable errors which are: IOException, and SocketTimeoutException. For obvious reasons (The website may be offline, and/or under maintenance)
Thing is: In a specific website (THIS one) I get random IOExceptions all the time. I tried predicting it, but I still don't know why I'm getting this error.
I figured it was from bombing it with requests during test phase. It is not, since in 2 or 3 days without sending another requisition it still throws me the error.
In a nutshell: The site do not require authentication, and it randomly throws 403. RANDOMLY
Since 403 can be multiple different errors, I'd like to see what is the specific problem with my application.
If I could get which 403 it i, I could try and work around it. (403.1, 403.2, ..., 403.n)
//If you guys want the code, it's a basic Jsoup get.
//(I have also tried it with native API,
//and still get the same random 403 errors)
//Note that I also tried it with no redirection, and still get the error
Document doc = Jsoup
.connect("http://www.agoramt.com.br/")
.timeout(60000)
.followRedirects(true)
.get();
//You may criticize about the code. But this specific line is the one
//that throws the error. And it doesn't randomly do that to other 3k
//site connections. That's why I want to get the specifics from the 403
Upvotes: 1
Views: 390
Reputation: 60958
I have little idea what Jsoup is, but I suggest you read up on HttpURLConnection.getErrorStream(). This method will allow you to read the error document. Access to the header fields of the error document should be possible after a failed connection as well, the way you usually access header fields. Together, these two (body and header) will provide you with all the information which the server supplies.
Upvotes: 0
Reputation: 9905
To piggy-back on what a couple others have said, is it possible your crawler is being recognized and treated as a network scanner or penetration tool?
Upvotes: 0
Reputation: 2841
In the design of a webcrawler, unexpected outages and error codes should be accounted for.
Keep a queue of sites that had a failure last time so that after a period of time, the webcrawler can retry the request.
Upvotes: 1
Reputation: 81
maybe try adding index.php to the end (or what ever the main homepage for the site is.. index.html, etc..)
I am unsure if this will help solve your problem however. I use a Connection class that I found somewhere, which basicly says as one of the posts above said (emulates the Headers of a web browser, so say... it appears like it is coming from FireFox, instead of what ever the java default is).
I guess it is worth a shot.
Upvotes: 0
Reputation: 503
It could be a faulty internet connection at the site, it could have code to try to stop spiders. There could be a weird proxy server in the way.
Upvotes: 0
Reputation: 2425
A server may return a 403 on a whim. You are not expected to resolve this on your end except to respect the server's wishes not to let you in. You may try to read the response body for details provided by the server, but that's probably all you'll get. The 403.n error codes you are looking for, I believe, is an IIS-specific feature and the site you pointed out seems to be serving with nginx, so don't expect to get those.
If your webcrawler randomly gets a 403 but a regular web browser (from the same IP) never gets a 403 then the best I could suggest is for you to make your webcrawler request headers look exactly like what a regular web browser would send. Whether that is proper behavior for a polite webcrawler is a different discussion.
Upvotes: 3
Reputation:
What the problem may be, is that there is a folder that you can get to, your program wants to read all the files on the site, but the webserver gives a 403 error, and will probably kill the socket. This is what i'm thinking, without code, I can't tell its a programmatical error or just the configuration of the webserver.
Upvotes: 0