Reputation: 33
I am having trouble with specific links with urllib. Below is the code sample I use:
from urllib.request import Request, urlopen
import re
url = ""
req = Request(url)
html_page = urlopen(req).read()
print(len(html_page))
Here are the results I get for two links:
url = "https://www.dafont.com"
Length: 0
url = "https://www.stackoverflow.com"
Length: 196673
Anyone got any idea why this happens?
Upvotes: 1
Views: 634
Reputation: 1
I found that using selenium helped me fix a similar issue, where webpages returned by urlopen were empty.
For a url
, I use:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--no-sandbox")
options.add_argument('--headless')
options.binary_location = "/usr/bin/chromium-browser"
driver = webdriver.Chrome(options=options)
driver.get(url)
html = driver.page_source
Note that you will need to install a chromium browser for this to work, and edit its location to where it's installed on your machine.
Upvotes: 0
Reputation: 130
Try using. You will get the response. Certain websites are secured and only respond to certain user-agents only.
from urllib.request import Request, urlopen
url = "https://www.dafont.com"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
req = Request(url, headers=headers)
html_page = urlopen(req).read()
print(len(html_page))
Upvotes: 2
Reputation: 3381
This is a limitation imposed by the authors dafont website.
By default, the urllib sends a User-Agent header of urllib/VVV
, where VVV
is the urllib version number. For more see: https://docs.python.org/3/library/urllib.request.html Many webmasters protect themselves from crawlers. They parse User-Agent header. So when they come across an User-Agent header like urllib/VVV
, they think it's a crawler.
Testing HEAD method:
$ curl -A "Python-urllib/2.6" -I https://www.dafont.com
HTTP/1.1 200 OK
Date: Sun, 13 Jun 2021 15:11:53 GMT
Server: Apache
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
Content-Type: text/html
$ curl -I https://www.dafont.com
HTTP/1.1 200 OK
Date: Sun, 13 Jun 2021 15:12:02 GMT
Server: Apache
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
Set-Cookie: PHPSESSID=dcauh0dp1antb7eps1smfg2a76; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Content-Type: text/html
Testing GET method:
$ curl -sSL -A "Python-urllib/2.6" https://www.dafont.com | wc -c
0
$ curl -sSL https://www.dafont.com | wc -c
18543
Upvotes: 2