Reputation: 13
I have been learning to code for the last couple of months and, even though I've been pretty sucessful so far, I am stuck with my latest script.
Just learned how to scrape websites for basic stuff and made a script that saves an image from a page and goes to the next image on a loop. However, after a few loops the script crashes and gives this error message back:
2020-06-03 19:41:03,243 - DEBUG- #######Res sURL=https://xkcd.com/2277/
2020-06-03 19:41:03,245 - DEBUG- Starting new HTTPS connection (1): xkcd.com:443
2020-06-03 19:41:03,781 - DEBUG- https://xkcd.com:443 "GET /2276/ HTTP/1.1" 200 2607
Traceback (most recent call last):
File "C:/Users/ivanx/PycharmProjects/strong passworddetection/xkcd.py", line 15, in <module>
comicURL = 'https:'+ elms[0].get('src')
IndexError: list index out of range
I have no idea what is wrong, any advise?
The whole script is as follows:
import requests, bs4, logging, os
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s- %(message)s')
os.makedirs('xkcd', exist_ok=True)
sURL = 'https://xkcd.com/'
logging.debug('Start of the while loop')
while not sURL.endswith('#'):
site = requests.get(sURL)
soup = bs4.BeautifulSoup(site.text, 'html.parser')
elms = soup.select('#comic > img')
comicURL = 'https:'+ elms[0].get('src')
res = requests.get(comicURL)
res.raise_for_status()
imgFile = open(os.path.join('xkcd', os.path.basename(comicURL)), 'wb')
for chunk in res.iter_content(100000):
imgFile.write(chunk)
imgFile.close()
logging.debug('#######Res sURL=' + str(sURL))
nElms = soup.select('#middleContainer > ul:nth-child(4) > li:nth-child(2) > a')
pURL = 'https://xkcd.com'+nElms[0].get('href')
sURL = pURL
Also, I forgot to add that if I try to start the loop on image 2276, like this:
sURL = 'https://xkcd.com/2276'
I get this error:
2020-06-03 19:53:26,245 - DEBUG- Start of the while loop
2020-06-03 19:53:26,248 - DEBUG- Starting new HTTPS connection (1): xkcd.com:443
2020-06-03 19:53:26,766 - DEBUG- https://xkcd.com:443 "GET /2276 HTTP/1.1" 301 178
2020-06-03 19:53:26,790 - DEBUG- https://xkcd.com:443 "GET /2276/ HTTP/1.1" 200 2607
Traceback (most recent call last):
File "C:/Users/ivanx/PycharmProjects/strong passworddetection/xkcd.py", line 15, in <module>
comicURL = 'https:'+ elms[0].get('src')
IndexError: list index out of range
Upvotes: 0
Views: 213
Reputation: 195468
I ran your scraper for some time, and there are few bugs (I left the comments - sometimes there aren't any images, so you need to check for that too):
import requests, bs4, logging, os
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s- %(message)s')
os.makedirs('xkcd', exist_ok=True)
sURL = 'https://xkcd.com/'
logging.debug('Start of the while loop')
while not sURL.endswith('#'):
logging.debug('#######Res sURL=' + str(sURL))
site = requests.get(sURL)
soup = bs4.BeautifulSoup(site.text, 'html.parser')
img = soup.select_one('#comic img') # <--- remove < in the selector
if img: # <--- Sometimes there's only text, no <IMG>
comicURL = 'https:'+ img['src']
res = requests.get(comicURL)
res.raise_for_status()
imgFile = open(os.path.join('xkcd', os.path.basename(comicURL)), 'wb')
for chunk in res.iter_content(100000):
imgFile.write(chunk)
imgFile.close()
else:
logging.debug('### IMAGE NOT FOUND ###')
pURL = 'https://xkcd.com' + soup.select_one('.comicNav a[rel="prev"]')['href'] # <-- change in selector for previous link
sURL = pURL
Upvotes: 0
Reputation: 23171
You're getting that because your call to the selector returns a 0 element list.
on the page https://xkcd.com/2276, you can see that the 'comic' div looks like this:
<div id="comic">
<a href="https://twitter.com/kakape/status/1235319133585248259"><img src="//imgs.xkcd.com/comics/self_isolate.png" title="Turns out I've been "practicing social distancing" for years without even realizing it was a thing!" alt="Self-Isolate" srcset="//imgs.xkcd.com/comics/self_isolate_2x.png 2x"/></a>
</div>
Because the img is wrapped in an anchor tag, the entire thing is the child, not the image.
To select the image element for this structure, use this selector:
elms = soup.select('#comic img')
This will select any img contained in the comic div.
Upvotes: 2