Reputation: 476
I need to mirror recursively some site wallpaper images having a specific markup around, like:
<div class="wb_more">
Original Resolution: <a href="//site.com/download/space_planet_sky_94434/4800x2700">4800x2700</a><br>
Views: <a href="/download/last">96661</a>
</div>
but not others, like:
<div class="wd_resolution">
<span class="wd_res_cat">Fullscreen</span>
<span class="wd_res_cat_raz"><a class="wb_res_select" href="//site.com/download/space_planet_sky_94434/1600x1200">1600x1200</a>
...
</span>
...
</span>
</div>
Note, the URLs are the same, except for the resolutions, but the resolutions of the originals might vary, so only the markup around makes the difference, like preceeding the link with a text like Original Resolution:
.
Is there a solution for this using wget or httrack or some other tool?
Thank you.
Upvotes: 0
Views: 1110
Reputation: 855
You can do this with scraping tools like scrapy. You can parse html response with css, xpath, regex or others to get links matching yours rule.
I think it is preferable to make scraper per site. For example, for the first one:
import scrapy
class imageLink(scrapy.Spider):
name = 'imageLink'
# Here the list of url to start scraping
start_urls = ['https://images.com']
def parse(self, response):
# Get the link
link = response.css('div.wb_more > a ').xpath('@href')
# Make a callback to save the image
yield scrapy.Request(url=url, callback=self.save_image)
def save_image(self, response):
link = response.url
# Guess the filename from the link
# space_planet_sky_94434
filename = link.split('/')[5]
# Save the image
with open(filename, 'wb') as f:
f.write(response.body)
If the site has pagination for images you can add a callback to parse with the link of the next page.
I didn't test the code.
Upvotes: 2
Reputation: 402
You can try to use a normal wget
and use regex on it (with sed
or perl
for example)
And then download the link you obtain (wget can do it)
A basic script will look like that
wget [URL] -o MyPage.html
./GetTheFlag.pl -html MyPage.html > WallPaperList.txt
wget -i WallPaperList.txt #here you can put your image where you want
With the GetFlag.pl looks like
use warnings;
use strict;
use Getopt::Long;
my $Url;
my $MyPage;
GetOptions("html=s" => \$MyPage);
open(FILE,$MyPage);
my $content=FILE;
my @urls=($content =~ //gi) #insert here a regex depending on the flags around
foreach my $i (@urls) {
print "$i\n";
}
For example if your url are <a href="url">New Wallpaper</a>
the regex will be
=~ /<a href="(\w+)">New Wallpaper</a>
care about \w
it misses some character that can't be used in var name as -
Hope this is clear enough.
Upvotes: 1