Zia Lvinas
Zia Lvinas

Reputation: 11

Simple Python Image Scraper Script

It's fairly simple stuff here...So i'm currently experimenting with python, and I have very little experience... I wanted to create an image scraper what goes to page downloads the image clicks link (next page) and downloads other image and so on (as source I use website similar to 9gag). Right now my script can just print the image url and next link url, so I cant figure out how to make my bot click on link and download next image and do it infinitely (until condition met/stopped etc)...

PS im using beautifulsoup4 (i think LOL)

Thanks in advance, Zil

Here what the script look like now, i was kinda combining couple scripts into one, and so the script looks very unclean...

import requests
from bs4 import BeautifulSoup
import urllib

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url2 = 'http://linksmiau.net/linksmi_paveiksliukai/rimtas_rudeninis_ispejimas_merginoms/1819/'
        url = url2
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")

        for img in soup.findAll('img', {'class': 'img'}):
            temp = img.get('src')
            if temp[:1]=="/":
                image = "http://linksmiau.net" + temp
            else:
                image = temp

        print(image)


        for lnk in soup.findAll('div', {'id': 'arrow_right'}):
                nextlink = lnk.get('onclick')
                link = nextlink.replace("window.location = '", "")
                lastlink = "http://linksmiau.net" + link
                page += 1
        print(lastlink)
        url2 == lastlink

trade_spider(3)

Upvotes: 0

Views: 2664

Answers (2)

DougieHauser
DougieHauser

Reputation: 470

I wouldn't think of it in terms of "clicking" a link, since you're writing a script, and not using a browser.

What you need is to figure out 4 things:

  1. Given a url, how do you get the HTML behind it and parse it with beautifulSoup - it sounds like you've got this part down already. :)

  2. Given many different htmls, how do you identify the images you want to download and the "next" link. - Once again, beautifulSoup.

  3. Given a url of an image (found in the "src" attribute of <img> tags), how do you save the image to disk. Answers can be found in StackOverflow questions like these: Downloading a picture via urllib and python

  4. Given a url of a "next" link, how do you "click" on it - Once again, you're not really "clicking" you just download the HTML from this new link and start the entire cycle once again (parse it, identify the image and the "next" link", download the image, fetch HTML behind "next" link).

Once you've broken the problem down, all that's left is to assemble everything in one nice script, and you're done.

Good luck :)

Upvotes: 1

Zia Lvinas
Zia Lvinas

Reputation: 11

It's fixed. DougieHauser was right and I want to shake his hand for that.

I just moved url2 row outside of while loop and it's seems to work just fine, now all I need is to figure out how to make this script to save pictures on my hdd LOL

def trade_spider(max_pages):
    url2 = 'http://linksmiau.net/linksmi_paveiksliukai/rimtas_rudeninis_ispejimas_merginoms/1819/'
    page = 1
    while page <= max_pages:
#url2 = 'http://linksmiau.net/linksmi_paveiksliukai/rimtas_rudeninis_ispejimas_merginoms/1819/'
        url = url2
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        #current_bet_id = "event_odd_id_31362885" #+ str(5)

        #for link in soup.findAll('span', {'class': 'game'}, itemprop="name"):

Upvotes: 1

Related Questions