Download HTML and Images with WGet without first few lines

Question

I'm attempting to use wget with the -p option to download specific documents and the images linked in the HTML.

The problem is, the site that is hosting the HTML has some non-html information preceding the HTML. This is causing wget to not interpret the document as HTML and doesn't search for images.

Is there a way to have wget strip the first X lines and/or force searching for images?

Example URL:

http://www.sec.gov/Archives/edgar/data/13239/000119312510070346/ds4.htm

First Lines of Content:


S-4
1
ds4.htm
FORM S-4


Form S-4

Last Lines of Content:

EDIT: Solutions in PHP are certainly accepted.

Devon_C_Miller · Accepted Answer

Wget is actually detecting the img tags. The issue is the website is question has a robots.txt that disallows /Archives. Wget honors that request and does not retrieve additional documents.

However, you can use the downloaded document as input to wget to retrieve related documents:

wget -l 1 --base=url --force-html -i file

Download HTML and Images with WGet without first few lines

Answers (2)

Related Questions