SF.
SF.

Reputation: 14039

How to politely ask remote webpage if it changed?

The remote webpage is updated - sometimes slower, once in ten minuter or so. Sometimes more often, like every minute or more frequently. There's a piece of data on that page I'd want to store, updating it whenever it changes (not necessarily grabbing every change but not falling too far behind the "current", and keeping the updates run 24/7).

Downloading the whole remote page every minute to check if it differs from previous version is definitely on the rude side.

Pinging the remote website for headers once a minute won't be too excessive.

If there's any hint when to recheck for updates, or have the server reply with the content only after the content changes, it would be ideal.

How should I go about minimizing unwanted traffic to the remote server while still staying up-to-date?

The "watcher/updater" is written in PHP, fetching the page using simplexml_load_file() to grab the remote URL every minute now, so something that plays nice with that (e.g. doesn't drop the connection upon determining the file differs only to reconnect for actual content half a second later, just proceeds with the content request) would be probably preferred.

edit: per request, sample headers.

    > HEAD xxxxxxxxxxxxxxxxxxxxxxxxxxx HTTP/1.1
    > User-Agent: curl/7.27.0
    > Host: xxxxxxxxxxxxxx
    > Accept: */*
    > 
    * additional stuff not fine transfer.c:1037: 0 0
    * HTTP 1/.1 or later with persistent connection, pipelining supported
    < HTTP/1.1 200 OK
    < Server: nginx
    < Date: Tue, 18 Feb 2014 19:35:04 GMT
    < Content-Type: application/rss+xml; charset=utf-8
    < Content-Length: 9865
    < Connection: keep-alive
    < Status: 200 OK
    < X-Frame-Options: SAMEORIGIN
    < X-XSS-Protection: 1; mode=block
    < X-Content-Type-Options: nosniff
    < X-UA-Compatible: chrome=1
    < ETag: "66509a4967de2c5984aa3475188012df"
    < Cache-Control: max-age=0, private, must-revalidate
    < X-Request-Id: 351a829a-641b-4e9e-a7ed-80ea32dcb071
    < X-Runtime: 0.068888
    < X-Powered-By: Phusion Passenger
    < X-Frame-Options: SAMEORIGIN
    < Accept-Ranges: bytes
    < X-Varnish: 688811779
    < Age: 0
    < Via: 1.1 varnish
    < X-Cache: MISS

Upvotes: 2

Views: 79

Answers (2)

deceze
deceze

Reputation: 522042

ETag: "66509a4967de2c5984aa3475188012df"

This is a very promising header. If it indeed corresponds to changes in the page itself, you can query the server setting this request header:

If-None-Match: "<the last received etag value>"

If the content was not modified, the server should respond with a 304 Not Modified status and no body. See http://en.wikipedia.org/wiki/HTTP_ETag. It also seems to be running a cache front end, so you're probably not hitting it too hard anyway.

Upvotes: 2

Amal Murali
Amal Murali

Reputation: 76646

Send an HTTP HEAD request using cURL and retrieve the Last-Modified value. This is similar to GET but HEAD only transfers the status line and header section, so you won't be "rude" to the other server if you're sending a HEAD request.

In command-line, we can achieve this using the following command:

curl -s -v -X HEAD http://example.com/file.html 2>&1 | grep '^< Last-Modified:'

It shouldn't be too hard to rewrite this using PHP's cURL library.

Upvotes: 2

Related Questions