Talal Wasim
Talal Wasim

Reputation: 33

python urllib, returns empty page for specific urls

I am having trouble with specific links with urllib. Below is the code sample I use:

from urllib.request import Request, urlopen
import re

url = ""
req = Request(url)
html_page = urlopen(req).read()

print(len(html_page))

Here are the results I get for two links:

url = "https://www.dafont.com"
Length: 0

url = "https://www.stackoverflow.com"
Length: 196673

Anyone got any idea why this happens?

Upvotes: 1

Views: 634

Answers (3)

lucy3_li
lucy3_li

Reputation: 1

I found that using selenium helped me fix a similar issue, where webpages returned by urlopen were empty.

For a url, I use:

from selenium import webdriver 

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--no-sandbox")
options.add_argument('--headless')
options.binary_location = "/usr/bin/chromium-browser"
driver = webdriver.Chrome(options=options)

driver.get(url)
html = driver.page_source

Note that you will need to install a chromium browser for this to work, and edit its location to where it's installed on your machine.

Upvotes: 0

Sanket Vyawahare
Sanket Vyawahare

Reputation: 130

Try using. You will get the response. Certain websites are secured and only respond to certain user-agents only.

from urllib.request import Request, urlopen

url = "https://www.dafont.com"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
req = Request(url, headers=headers)
html_page = urlopen(req).read()

print(len(html_page))

Upvotes: 2

serghei
serghei

Reputation: 3381

This is a limitation imposed by the authors dafont website.

By default, the urllib sends a User-Agent header of urllib/VVV, where VVV is the urllib version number. For more see: https://docs.python.org/3/library/urllib.request.html Many webmasters protect themselves from crawlers. They parse User-Agent header. So when they come across an User-Agent header like urllib/VVV, they think it's a crawler.

Testing HEAD method:

$ curl -A "Python-urllib/2.6" -I https://www.dafont.com
HTTP/1.1 200 OK
Date: Sun, 13 Jun 2021 15:11:53 GMT
Server: Apache
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
Content-Type: text/html

$ curl -I https://www.dafont.com
HTTP/1.1 200 OK
Date: Sun, 13 Jun 2021 15:12:02 GMT
Server: Apache
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
Set-Cookie: PHPSESSID=dcauh0dp1antb7eps1smfg2a76; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Content-Type: text/html

Testing GET method:

$ curl -sSL -A "Python-urllib/2.6" https://www.dafont.com | wc -c
       0

$ curl -sSL https://www.dafont.com | wc -c
   18543

Upvotes: 2

Related Questions