Reputation:
I am trying to scrape a website using wget. Here is my command:
wget -t 3 -N -k -r -x
The -N means "don't download file if server version older than local version". But this isn't working. The same files get downloaded over and over again when I restart the above scraping operation - even though the files have no changes.
Many of the downloaded pages report:
Last-modified header missing -- time-stamps turned off.
I've tried scraping several web sites but all tried so far give this problem.
Is this a situation controlled by the remote server? Are they choosing not so send those timestamp headers? If so, there may not be much I can do about it?
I am aware of the -NC (no clobber) option, but that will prevent an existing file not being overwritten even if the server file is newer, resulting in stale local data accumulating.
Thanks Drew
Upvotes: 4
Views: 4388
Reputation: 3378
The wget -N
switch does work, but a lot of web servers don't send the Last-Modified header for various reasons. For example, dynamic pages (PHP or any CMS, etc.) have to actively implement the functionality (figure out when the content was last modified, and send the header). Some do, while some don't.
There really isn't another reliable way to check if a file has been changed, either.
Upvotes: 2