Reputation: 41
I'm trying to mirror the whole website at http://opposedforces.com/parts/impreza/en_g11/type_63/
Accessing through a browser (Firefox, w3m) or Postman work fine, and return the html file.
Accessing through wget
, cURL
, the Python requests
module and HTTrack all fail.
wget
specifically fails with:
↪ wget --mirror -p --convert-links "http://opposedforces.com/parts/impreza/en_g11/type_63/"
--2021-02-03 20:48:29-- http://opposedforces.com/parts/impreza/en_g11/type_63/
Resolving opposedforces.com (opposedforces.com)... 138.201.30.59Connecting to opposedforces.com (opposedforces.com)|138.201.30.59|:80... connected.
HTTP request sent, awaiting response... 0
2021-02-03 20:48:29 ERROR 0: (no description).
Converted links in 0 files in 0 seconds.
It seemingly returns no information. Originally I thought some JavaScript was generating the html, but I can't find any JS using Firefox developer tools, and I would assume Postman would not work in this case.
Any ideas how to get around this? Ideally I can use wget
to download this and all sub-pages, but alternative solutions are also welcome.
Upvotes: 1
Views: 722
Reputation: 5190
This is one of those times when the website is completely and absolutely broken. It is unfortunate that web browsers go to great lengths to support such broken web pages.
The problem is that the server sends a broken response. This is the response I see:
---response begin---
HTTP/1.1 000
Cache-Control: no-cache
Pragma: no-cache
Content-Length: 44892
Expires: -1
Server: Microsoft-IIS/7.5
X-AspNet-Version: 2.0.50727
Set-Cookie: ASP.NET_SessionId=gxhoir45jpd43545iujdpiru; path=/; HttpOnly
X-Powered-By: ASP.NET
Date: Fri, 05 Feb 2021 09:26:26 GMT
See? It returns a HTTP/1.1 000 response, which doesn't exist in the spec. Web browsers seem to just accept it as a 200 response and move on. Wget doesn't.
But you can get around it by using the --content-on-error
option which is ask Wget to download the content irrespective of the response code
Upvotes: 1