Reputation: 11
It's fairly simple stuff here...So i'm currently experimenting with python, and I have very little experience... I wanted to create an image scraper what goes to page downloads the image clicks link (next page) and downloads other image and so on (as source I use website similar to 9gag). Right now my script can just print the image url and next link url, so I cant figure out how to make my bot click on link and download next image and do it infinitely (until condition met/stopped etc)...
PS im using beautifulsoup4 (i think LOL)
Thanks in advance, Zil
Here what the script look like now, i was kinda combining couple scripts into one, and so the script looks very unclean...
import requests
from bs4 import BeautifulSoup
import urllib
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url2 = 'http://linksmiau.net/linksmi_paveiksliukai/rimtas_rudeninis_ispejimas_merginoms/1819/'
url = url2
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for img in soup.findAll('img', {'class': 'img'}):
temp = img.get('src')
if temp[:1]=="/":
image = "http://linksmiau.net" + temp
else:
image = temp
print(image)
for lnk in soup.findAll('div', {'id': 'arrow_right'}):
nextlink = lnk.get('onclick')
link = nextlink.replace("window.location = '", "")
lastlink = "http://linksmiau.net" + link
page += 1
print(lastlink)
url2 == lastlink
trade_spider(3)
Upvotes: 0
Views: 2664
Reputation: 470
I wouldn't think of it in terms of "clicking" a link, since you're writing a script, and not using a browser.
What you need is to figure out 4 things:
Given a url, how do you get the HTML behind it and parse it with beautifulSoup - it sounds like you've got this part down already. :)
Given many different htmls, how do you identify the images you want to download and the "next" link. - Once again, beautifulSoup.
Given a url of an image (found in the "src" attribute of <img>
tags), how do you save the image to disk.
Answers can be found in StackOverflow questions like these:
Downloading a picture via urllib and python
Given a url of a "next" link, how do you "click" on it - Once again, you're not really "clicking" you just download the HTML from this new link and start the entire cycle once again (parse it, identify the image and the "next" link", download the image, fetch HTML behind "next" link).
Once you've broken the problem down, all that's left is to assemble everything in one nice script, and you're done.
Good luck :)
Upvotes: 1
Reputation: 11
It's fixed. DougieHauser was right and I want to shake his hand for that.
I just moved url2 row outside of while loop and it's seems to work just fine, now all I need is to figure out how to make this script to save pictures on my hdd LOL
def trade_spider(max_pages):
url2 = 'http://linksmiau.net/linksmi_paveiksliukai/rimtas_rudeninis_ispejimas_merginoms/1819/'
page = 1
while page <= max_pages:
#url2 = 'http://linksmiau.net/linksmi_paveiksliukai/rimtas_rudeninis_ispejimas_merginoms/1819/'
url = url2
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
#current_bet_id = "event_odd_id_31362885" #+ str(5)
#for link in soup.findAll('span', {'class': 'game'}, itemprop="name"):
Upvotes: 1