Colin Ramsay
Colin Ramsay

Reputation: 16476

Check if web page is modifed / has expired with Ruby

I'm writing a crawler for Ruby, and I want to honour the headers that the server sends out in order to make the crawl more efficient. Is there a straightforward way in Ruby of determining whether a page needs to be re-downloaded by the client? I know I need to consider at least these headers:

What's the definitive way of determining this - is it specified anywhere?

Upvotes: 0

Views: 196

Answers (2)

danivovich
danivovich

Reputation: 4217

You are right on the headers you will need to look at, but you need to consider that the server is what is setting these. If they are set correctly, then you can use them to make the decision, but none of them are required.

Personally, I would probably start with tracking the expires value as I do the initial download, as well as logging the etag. Finally I'd look at last modified as I did the next pass, assuming the expires or etag showed some sign that I might need to re-download (or if they aren't even set). I wouldn't expect Cache Control to be all the useful.

Upvotes: 1

glenn jackman
glenn jackman

Reputation: 247042

You'll want to read about the head method in Net::HTTP -- http://www.ruby-doc.org/stdlib/

Upvotes: 0

Related Questions