alreadyexists
alreadyexists

Reputation: 355

Nutch inconsistently ignores redirects

I ran into trouble with crawling (nutch 1.9/openjdk7) pretty simple redirect cases. Here is a packet capture for the process.

Time        Source          Destination Protocol Info
12.988003   99.99.99.99     8.8.4.4     DNS     Standard query 0xc165  A bloomberg.com
13.032343   8.8.4.4         99.99.99.99 DNS     Standard query response 0xc165  A 69.191.212.191 A 69.191.251.238
13.124471   99.99.99.99 69.191.212.191  HTTP    GET /robots.txt HTTP/1.0 
13.228846   69.191.212.191  99.99.99.99 HTTP    HTTP/1.1 301 Moved Permanently  (text/html)
13.264230   99.99.99.99     8.8.4.4     DNS     Standard query 0x7089  A www.bloomberg.com
13.344767   8.8.4.4         99.99.99.99 DNS     Standard query response 0x7089  CNAME www.bloomberg.com.edgekey.net CNAME e4569.x.akamaiedge.net A 23.214.189.136
13.351030   99.99.99.99 23.214.189.136  HTTP    GET /robots.txt HTTP/1.0 
13.359121   23.214.189.136  99.99.99.99 HTTP    HTTP/1.0 200 OK  (text/plain)
13.448604   99.99.99.99 69.191.212.191  HTTP    GET / HTTP/1.0 
13.537211   69.191.212.191  99.99.99.99 HTTP    HTTP/1.1 301 Moved Permanently  (text/html)
13.640146   99.99.99.99 69.191.212.191  HTTP    GET / HTTP/1.0 
13.738564   69.191.212.191  99.99.99.99 HTTP    HTTP/1.1 301 Moved Permanently  (text/html)

Nutch tries to fetch http://bloomberg.com which replies with a 301 redirect to http://www.bloomberg.com. The redirect is handled correctly for robots.txt. However, for 'get /', fetcher keeps trying the original hostname, which keeps replying 301. No matter how big http.redirect.max, fetching fails (I've checked 10).

Nutch 1.9 running on OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-0ubuntu0.12.04.1) OpenJDK Client VM (build 24.65-b04, mixed mode, sharing)

Is this a bug (could you confirm it then) or just a misconfiguration?

Thanks.

Upvotes: 0

Views: 308

Answers (1)

alreadyexists
alreadyexists

Reputation: 355

This was a bug, 1.10 must to be shipped with the fix: https://github.com/apache/nutch/commit/ed052df8822380ccfa89a9ffa1df324933669a59

Upvotes: 1

Related Questions