Bernd Wilke πφ
Bernd Wilke πφ

Reputation: 10791

wrong srcset attributes from httrack

I have spidered a website with httracks and a lot of files on different levels are generated. But the website uses picture / source tags with srcset attributes which httrack does not handle, all those pictures does not work well offline.

httrack can see the links if a use the option Attempt to detect all links (even in unknown tags/javascript code) (in winhttrack) and copied all images to the local storage. But it did not change the path to relative.

Now I need a script (powershell/gnu bash) which can edit all the html files to adapt the pathes in the srcsets to the correct relative path.

my idea would be a recursion for each folder with an additional ../ as parameter to insert/ replace with sed.

what to do:

example files:

index.html
cat1/product1.html
cat2/option3/product5.html

each contains some picture tags like:

<picture>
     <source srcset="/images/img1_low.jpg, /images/img1_low_ret.jpg x2" media="(max-width: 470px)">
     <source srcset="/images/img1_med.jpg, /images/img1_med_ret.jpg x2" media="(max-width: 960px)">
     <source srcset="/images/img1_hi.jpg, /images/img1_hi_ret.jpg x2" media="(min-width: 961px)">
     <img src="../images/img1_hi.jpg" />
</picture>

inside the image tag the path is always correct done from httrack: (images/img1_hi.jpg, ../images/img1_hi.jpg, ../../images/img1_hi.jpg)

but the source tag also must contain the matching pathes:

in index.html:

<picture>
     <source srcset="images/img1_low.jpg, images/img1_low_ret.jpg x2" media="(max-width: 470px)">
     <source srcset="images/img1_med.jpg, images/img1_med_ret.jpg x2" media="(max-width: 960px)">
     <source srcset="images/img1_hi.jpg, images/img1_hi_ret.jpg x2" media="(min-width: 961px)">
     <img src="images/img1_hi.jpg" />
</picture>

in cat1/product1.html:

<picture>
     <source srcset="../images/img1_low.jpg, ../images/img1_low_ret.jpg x2" media="(max-width: 470px)">
     <source srcset="../images/img1_med.jpg, ../images/img1_med_ret.jpg x2" media="(max-width: 960px)">
     <source srcset="../images/img1_hi.jpg, ../images/img1_hi_ret.jpg x2" media="(min-width: 961px)">
     <img src="../images/img1_hi.jpg" />
</picture>

in cat2/option3/product5.html:

<picture>
     <source srcset="../../images/img1_low.jpg, ../../images/img1_low_ret.jpg x2" media="(max-width: 470px)">
     <source srcset="../../images/img1_med.jpg, ../../images/img1_med_ret.jpg x2" media="(max-width: 960px)">
     <source srcset="../images/img1_hi.jpg, ../../images/img1_hi_ret.jpg x2" media="(min-width: 961px)">
     <img src="../../images/img1_hi.jpg" />
</picture>

my attempt:

#!/usr/bin/bash

function workfolder {
    # $1 = current folder
    # $2 = prefix upfolders

    pushd $PWD
    cd $1

    for i in $( ls ) ; do
        if [ -d $i ] ; then
            workfolder $i ../$2
        fi
    done

    for i in $( ls *.html ) ; do
        sed -i 's/srcset="images/srcset="$2images/g' $i
        sed -i 's/, images/, $2images/g' $i
    done

    popd

}

workfolder .

aside of too much errors the $2 in the sed replacement is not solved but replaced litteraly.

Upvotes: 2

Views: 1231

Answers (3)

Damien Wilson
Damien Wilson

Reputation: 354

Summary

This method works within HTTrack and modifies files on-the-fly. It doesn't require a post-action script as described in the context of this question.

Solution

The issue srcset has in HTTrack is that, in browsers that support srcset, the images are not rendered. As the OP states; this occurs because the browser is trying to load the URL that HTTrack did not convert.

The solution (credit to @jonathandavidarndt) was to remove the srcset attribute completely, this would make modern browsers fallback to the src attribute and allow the image to display as intended.

To achieve this I leveraged the -V option in HTTrack

-V : execute system command after each files ($0 is the filename: -V "rm $0") (--userdef-cmd )

The option value uses the UNIX command sed to transform text within the newly downloaded file:

sed -i 's/srcset="[^"]*"//g' $0

To launch HTTrack Cli with this option you can write a command like:

httrack https://website.to.copy.co.uk/ -V "sed -i \'s/srcset="[^"]*"//g\' \$0"

In summary:
Each time a file is downloaded by HTTrack, run the sed command on the file. If srcset is present, remove it.

Upvotes: 2

rojen
rojen

Reputation: 559

If you are using WordPress,

install the plugin Code Snippets. Add a new snippet with the following code:

add_filter( 'wp_calculate_image_srcset_meta', '__return_null' );

Press the "Save changes and Active" button. This snippet will remove srcset tag from your site. Code Source

Upvotes: -1

Bernd Wilke πφ
Bernd Wilke πφ

Reputation: 10791

#!/usr/bin/bash
function workfolder {
    # $1 = current folder
    # $2 = prefix upfolders

    pushd $PWD > /dev/null
    cd $1
    echo "=====^ $PWD ====="
    for i in $( ls ) ; do
        if [ -d $i ] ; then
            workfolder $i ..\\/$2
        fi
    done
    for i in $( ls *.html ) ; do
        echo " working on: $PWD/$i with $2"
        sed -i 's/srcset="image/srcset="'$2'image/g' $i
        sed -i 's/\,\ image/\,\ '$2'image/g' $i
    done
    popd > /dev/null
    echo "=====v $PWD ====="
}

workfolder .

traps are: using $2 in the sed command at all (1st attempt was not expanded) and the correct escaping of ../ as 2nd parameter in a form usable in the sed commands

Upvotes: 1

Related Questions