St. John Johnson
St. John Johnson

Reputation: 6660

Download HTML and Images with WGet without first few lines

I'm attempting to use wget with the -p option to download specific documents and the images linked in the HTML.

The problem is, the site that is hosting the HTML has some non-html information preceding the HTML. This is causing wget to not interpret the document as HTML and doesn't search for images.

Is there a way to have wget strip the first X lines and/or force searching for images?

Example URL:

First Lines of Content:

<DOCUMENT>
<TYPE>S-4
<SEQUENCE>1
<FILENAME>ds4.htm
<DESCRIPTION>FORM S-4
<TEXT>
<HTML><HEAD>
<TITLE>Form S-4</TITLE>

Last Lines of Content:

</BODY></HTML>
</TEXT>
</DOCUMENT>

EDIT: Solutions in PHP are certainly accepted.

Upvotes: 1

Views: 1565

Answers (2)

Devon_C_Miller
Devon_C_Miller

Reputation: 16528

Wget is actually detecting the img tags. The issue is the website is question has a robots.txt that disallows /Archives. Wget honors that request and does not retrieve additional documents.

However, you can use the downloaded document as input to wget to retrieve related documents:

wget -l 1 --base=url --force-html -i file

Upvotes: 1

Jamescun
Jamescun

Reputation: 693

In PHP, you could use this function to strip out X lines:

function strip_toplines($string,$lines){
    $string = explode(PHP_EOL,$string);
    foreach($string as $line_num => $line){
        if($line_num>($lines - 1)){
            $output .= $line . PHP_EOL;
        }
    }
    return trim($output);
}

and then this:

strip_toplines(file_get_contents($url),6);

Upvotes: 0

Related Questions