Piyush Ghasiya
Piyush Ghasiya

Reputation: 525

request.exceptions.MissingSchema: Invalid URL

I am trying to scrape webpage to get articles but the links don't have http:, so I am getting request.expections.MissingSchema: Invalid URL error.

I know that I have to try something like 'http:'+ href, but where should I put this I can't understand.

import time

import requests

from bs4 import BeautifulSoup

url = 'https://mainichi.jp/english/search?q=cybersecurity&t=kiji&s=match&p={}'

pages = 6

for page in range(1, pages+1):
    res = requests.get(url.format(page))
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select(".list-typeD li > a"):
        resp = requests.get(item.get("href"))
        sauce = BeautifulSoup(resp.text,"lxml")
        date = sauce.select(".post p")
        date = date[0].text
        title = sauce.select_one(".header-box h1").text
        content = [elem.text for elem in sauce.select(".main-text p")]
        print(f'{date}\n {title}\n {content}\n')

        time.sleep(3)

I will get the date, title, ​and content of all the articles from all pages.

Upvotes: 0

Views: 471

Answers (1)

nimishxotwod
nimishxotwod

Reputation: 335

This is because in the statement

resp = requests.get(item.get("href"))

you are not sending a request to a valid URL. The href tag might be containing relative URL, instead of absolute URL. Please try appending base url before the item.get("href")

This should do:

resp = requests.get("https:"+item.get("href"))

Upvotes: 2

Related Questions