Nobody-InHere
Nobody-InHere

Reputation: 1

MissingSchema(error) thrown when following tutorial

I am following "The Complete Python Course: Beginner to Advanced!" In SkillShare, and there is a point where my code breaks while the code in the tutorial continues just fine.

The tutorial is about making a webscraper with BeautifulSoup, Pillow, and IO. I'm supposed to be able to do a search for anything in bing, then save the pictures on the images search results to a folder in my computer.

Here's the Code:

from bs4 import BeautifulSoup
import requests
from PIL import Image
from io import BytesIO

search = input("Search for:")
params = {"q": search}
r = requests.get("http://bing.com/images/search", params=params)

soup = BeautifulSoup(r.text, "html.parser")
links = soup.findAll("a", {"class": "iusc"})

for item in links:
    img_obj = requests.get(item.attrs["href"])
    print("getting", item.attrs["href"])
    title = item.attrs["href"].split("/")[-1]
    img = Image.open(BytesIO(img_obj.content))
    img.save("C:\\Users\\user\\PycharmProjects\\webscrapery\\scraped_images" + title, img.format)

Whenever I run it, at the end it gives me a raise MissingSchema(error) requests.exceptions.MissingSchema: Invalid URL

I tried adding

img_obj = requests.get("https://" + item.attrs["href"])

but it keeps giving me the same error. I have gone and looked at the bing code, and the only change I have done is change the "thumb" class to "iusc". I tried using the "thumb" class as in the tutorial but then the program just runs without saving anything and eventually just finishes.

Thank you for your help

EDIT: Here is the whole error that is being thrown, as requested by baileythegreen:

Traceback (most recent call last):
  File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 14, in <module>
    img_obj = requests.get(item.attrs["href"])
  File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 515, in request
    prep = self.prepare_request(req)
  File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\sessions.py", line 443, in prepare_request
    p.prepare(
  File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 318, in prepare
    self.prepare_url(url, params)
  File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\models.py", line 392, in prepare_url
    raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '/images/search?view=detailV2&ccid=mhMFjL9x&id=AE886A498BB66C1DCDCC08B6B45163C71DBF18CB&thid=OIP.mhMFjL9xzdgqujACTRW4zAHaNL&mediaurl=https%3a%2f%2fimage.zmenu.com%2fmenupic%2f2349041%2fs_6565a805-53ac-4f35-a2cb-a3f79c3eab4b.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.9a13058cbf71cdd82aba30024d15b8cc%3frik%3dyxi%252fHcdjUbS2CA%26pid%3dImgRaw%26r%3d0&exph=1000&expw=562&q=pizza&simid=607993487650659823&FORM=IRPRST&ck=B86DF0449AD7ABD39A1B1697EA9E6D16&selectedIndex=0': No scheme supplied. Perhaps you meant http:///images/search?view=detailV2&ccid=mhMFjL9x&id=AE886A498BB66C1DCDCC08B6B45163C71DBF18CB&thid=OIP.mhMFjL9xzdgqujACTRW4zAHaNL&mediaurl=https%3a%2f%2fimage.zmenu.com%2fmenupic%2f2349041%2fs_6565a805-53ac-4f35-a2cb-a3f79c3eab4b.jpg&cdnurl=https%3a%2f%2fth.bing.com%2fth%2fid%2fR.9a13058cbf71cdd82aba30024d15b8cc%3frik%3dyxi%252fHcdjUbS2CA%26pid%3dImgRaw%26r%3d0&exph=1000&expw=562&q=pizza&simid=607993487650659823&FORM=IRPRST&ck=B86DF0449AD7ABD39A1B1697EA9E6D16&selectedIndex=0?

Edit 2: I followed hawschiat instructions, and I am getting a different error this time:

Traceback (most recent call last):
  File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 15, in <module>
    print("getting", item.attrs["href"])
KeyError: 'href'

However, if I keep the "src" attribute in the print line, I get

getting http://tse2.mm.bing.net/th/id/OIP.mhMFjL9xzdgqujACTRW4zAHaNL?w=187&h=333&c=7&r=0&o=5&pid=1.7    
Traceback (most recent call last):
      File "C:\Users\user\PycharmProjects\webscrapery\images.py", line 18, in <module>
        img.save(r'C://Users/user/PycharmProjects/webscrapery/scraped_images' + title, img.format)
      File "C:\Users\user\AppData\Local\Programs\Python\Python39\lib\site-packages\PIL\Image.py", line 2209, in save
        fp = builtins.open(filename, "w+b")
    OSError: [Errno 22] Invalid argument: 'C://Users/user/PycharmProjects/webscrapery/scraped_imageshttp://tse2.mm.bing.net/th/id/OIP.mhMFjL9xzdgqujACTRW4zAHaNL?w=187&h=333&c=7&r=0&o=5&pid=1.7'

I tried using the 'r' character in front of the C: path, but it keeps giving me the same error. I also tried to change the forward slashes to back slashes, and putting 2 slashes in front of the C. I also made sure I have permission to write on the scrapped_images folder, which I do, as well as webscrapery.

Upvotes: 0

Views: 395

Answers (1)

hawschiat
hawschiat

Reputation: 346

The last line of your stack trace gives you a hint of the cause of the error. The URL scraped from the webpage is not a full URL, but rather the path to the resource.

To make it a full URL, you can simply prepend it with the scheme and authority. In your case, that would be https://bing.com.

That being said, I don't think the URL you obtained is actually the URL to the image. Inspecting Bing Image's webpage using Chrome's developer tool, we can see that the structure of the page looks something like this:

HTML structure of Bing Image

Notice that the anchor (a) element points to the preview page while its child element img contains the actual path to the resource.

With that in mind, we can rewrite your code to something like:

links = soup.findAll("img", {"class": "mimg"})

for item in links:
    img_obj = requests.get(item.attrs["src"])
    print("getting", item.attrs["src"])
    title = item.attrs["src"].split("/")[-1]
    img = Image.open(BytesIO(img_obj.content))
    img.save("C:\\Users\\user\\PycharmProjects\\webscrapery\\scraped_images\\" + title, img.format)

And this should achieve what you are trying to do.

Upvotes: 1

Related Questions