Reputation: 1
I'm attempting to scrape the website at https://www.cbit.ac.in/current_students/acedamic-calendar/ using the requests
library along with BeautifulSoup
. However, upon making a request to the website, I encounter the following SSL certificate verification error:
requests.exceptions.SSLError:
HTTPSConnectionPool(host='www.cbit.ac.in', port=443):
Max retries exceeded with url:
/current_students/acedamic-calendar/
(Caused by SSLError(SSLCertVerificationError(1,
'[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)')))
To address the SSL verification issue, I've attempted to specify the path to the CA certificate using the verify parameter in the requests.get()
function call. The CA certificate path is /Users/rishilboddula/Downloads/cbit.ac.in.cer
. Despite this, the SSL verification error persists.
After successfully scraping the website, I intend to store the extracted URLs in a MongoDB collection named ull
using the pymongo
library. However, due to the SSL verification error, I'm unable to proceed with the scraping and data insertion process.
I'm seeking guidance on resolving the SSL certificate verification error to successfully scrape the website and insert the data into MongoDB. Additionally, if there are any best practices or alternative approaches for handling SSL certificate verification in Python, I would greatly appreciate any insights.
# Import necessary libraries
import requests
from bs4 import BeautifulSoup
import pymongo
# Specify the path to the CA certificate
ca_cert_path = '/Users/rishilboddula/Downloads/cbit.ac.in.cer'
# Make a request to the website with SSL verification
req = requests.get('https://www.cbit.ac.in/current_students/acedamic-calendar/', verify=ca_cert_path)
# Parse the HTML content
soup = BeautifulSoup(req.content, 'html.parser')
# Extract all URLs from the webpage
links = soup.find_all('a')
urls = [link.get('href') for link in links]
# Connect to MongoDB
client = pymongo.MongoClient('mongodb://localhost:27017')
db = client["data"]
ull = db["ull"]
# Insert each URL into the MongoDB collection
for url in urls:
ull.insert_one({"url": url})
Upvotes: 0
Views: 108
Reputation: 1
Check CA Certificate Path: Make sure ca_cert_path is correct and accessible. Verify the file location and permissions.
Update CA Certificates: SSL errors may occur due to outdated CA certificates. Update them on your system.
Use System Certificates: Instead of specifying a custom CA path, set verify=True to use the system's default certificates in requests.get().
Approach (Use System Certificates):
import requests
from bs4 import BeautifulSoup
import pymongo
#Make a request to the website with SSL verification using system certificates
req = requests.get('https://www.cbit.ac.in/current_students/acedamic-calendar/', verify=True)
#Parse the HTML content
soup = BeautifulSoup(req.content, 'html.parser')
#Extract all URLs from the webpage
links = soup.find_all('a')
urls = [link.get('href') for link in links]
#Connect to MongoDB
client = pymongo.MongoClient('mongodb://localhost:27017')
db = client["data"]
ull = db["ull"]
#Insert each URL into the MongoDB collection
for url in urls:
ull.insert_one({"url": url})
Upvotes: -1