Reputation: 51
I want to write webscraper to collect titles of articles from Medium.com webpage.
I am trying to write a python script that will scrape headlines from Medium.com website. I am using python 3.7 and imported urlopen
from urllib.request
.
But it cannot open the site and shows
"urllib.error.HTTPError: HTTP Error 403: Forbidden" error.
from bs4 import BeautifulSoup
from urllib.request import urlopen
webAdd = urlopen("https://medium.com/")
bsObj = BeautifulSoup(webAdd.read())
Result = urllib.error.HTTPError: HTTP Error 403: Forbidden
Expected result is that it will not show any error and just read the web site.
But this does not happen when I use requests module.
import requests
from bs4 import BeautifulSoup
url = 'https://medium.com/'
response = requests.get(url, timeout=5)
This time around it works without error.
Why ??
Upvotes: 1
Views: 2521
Reputation: 1917
this worked for me
import urllib
from urllib.request import urlopen
html = urlopen(MY_URL)
contents = html.read()
print(contents)
Upvotes: 0
Reputation: 1079
Many sites nowadays check where the user agent is coming from, to try and deter bots. requests
is the better module to use, but if you really want to use urllib
, you can alter the headers text, to pretend to be Firefox or something else, so that it is not blocked. Quick example can be found here:
https://stackoverflow.com/a/16187955
import urllib.request
user_agent = 'Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion'
url = "http://example.com"
request = urllib.request.Request(url)
request.add_header('User-Agent', user_agent)
response = urllib.request.urlopen(request)
You will need to alter the user_agent string with the appropriate versions of things too. Hope this helps.
Upvotes: 3
Reputation: 1193
Urllib is pretty old and small module. For webscraping, requests
module is recommended.
You can check out this answer for additional information.
Upvotes: 4