Reputation: 1951
I'm using the following code to fetch the html of a New York Times page and unfortunately, this is returning null. I have tried with other websites (CNN, The Guardian, etc) and they work fine. I'm using the URLFetchService from Google App Engine.
Here's the code snippet. Please tell me what am I doing wrong?
//url = https://www.nytimes.com/2017/05/02/us/politics/health-care-paul-ryan-fred-upton-congress.html
private String extractFromUrl(String url, boolean forced) throws java.io.IOException, org.xml.sax.SAXException,
de.l3s.boilerpipe.BoilerpipeProcessingException {
Future<HTTPResponse> urlFuture = getMultiResponse(url);
HTTPResponse urlResponse = null;
try {
urlResponse = urlFuture.get(); // Returns null here
} catch ( InterruptedException ie ) {
ie.printStackTrace();
} catch ( ExecutionException ee ) {
ee.printStackTrace();
}
String urlResponseString = new String(urlResponse.getContent());
return urlResponseString;
}
public Future<HTTPResponse> getMultiResponse(String website) {
URLFetchService fetcher = URLFetchServiceFactory.getURLFetchService();
URL url = null;
try {
url = new URL(website);
} catch (MalformedURLException e) {
e.printStackTrace();
}
FetchOptions fetchOptions = FetchOptions.Builder.followRedirects();
HTTPRequest request = new HTTPRequest(url, HTTPMethod.GET, fetchOptions);
Future<HTTPResponse> futureResponse = fetcher.fetchAsync(request);
return futureResponse;
}
The exception I'm getting is this:
java.util.concurrent.ExecutionException: java.io.IOException: Could not fetch URL: https://www.nytimes.com/2017/05/02/us/politics/health-care-paul-ryan-fred-upton-congress.html, error: Received exception executing http method GET against URL https://www.nytimes.com/2017/05/02/us/politics/health-care-paul-ryan-fred-upton-congress.html: null
[INFO] at com.google.appengine.api.utils.FutureWrapper.setExceptionResult(FutureWrapper.java:66)
[INFO] at com.google.appengine.api.utils.FutureWrapper.get(FutureWrapper.java:97)
[INFO] at main.java.com.myapp.app.MyServlet.extractFromUrl(MyServlet.java:10)
Upvotes: 0
Views: 108
Reputation: 482
Looking at the verbose output of curl, you can see that the website tries to set a cookie and redirects you in case the cookie is not accepted.
It appears that the times will redirect you 7 times before giving up -
$ curl --verbose -L "https://www.nytimes.com/2017/05/02/us/politics/health-care-paul-ryan-fred-upton-congress.html" 2>&1 | grep 303 | wc -l
7
It appears that the maximum number of redirects for UrlFetch is 5 [0].
In order to successfully crawl www.nytimes.com, you will have to disable following redirects and handle the cookie logic yourself. Some inspiration here [1] and here [2]
[0] https://groups.google.com/forum/#!topic/google-appengine/F2dX3LqOrhY
[1] https://groups.google.com/d/msg/google-appengine-java/pE0xak7LRxg/M__U-SM3YMMJ
[2] https://stackoverflow.com/a/13588616/7947020
Upvotes: 1