Shravan Yadav
Shravan Yadav

Reputation: 1317

Unable to read html page from beautiful soup

The below code got stuck after printing hi in output. Can you please check what is wrong with this? And if the site is secure and I need some special authentication?

from bs4 import BeautifulSoup
import requests

print('hi')
rooturl='http://www.hoovers.com/company-information/company-search.html'
r=requests.get(rooturl);
print('hi1')
soup=BeautifulSoup(r.content,"html.parser");
print('hi2')
print(soup)

Upvotes: 2

Views: 957

Answers (2)

KC.
KC.

Reputation: 3107

Unable to read html page from beautiful soup

Why you got this problem is website consider that you are robots, they won't send anything to you. And they even hang up the connection let you wait forever.

You just imitate browser's request, then server will consider you are not an robot.

Add headers is the simplest way to deal with this problem. But something you should not pass User-Agent only(like this time). Remember copy your browser's request and remove the useless element(s) through testing. If you are lazy use browser's headers straightly, but you must not copy all of them when you want to upload files

from bs4 import BeautifulSoup
import requests

rooturl='http://www.hoovers.com/company-information/company-search.html'
with requests.Session() as se:
    se.headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
        "Accept-Encoding": "gzip, deflate",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en"
    }
    resp = se.get(rooturl)
print(resp.content)
soup = BeautifulSoup(resp.content,"html.parser")

Upvotes: 4

chitown88
chitown88

Reputation: 28565

Was having the same issue as you. Just sat there. I tried by adding user-agent, and it pulled it realtively quickly. Don't know why that is though.

from bs4 import BeautifulSoup
import requests


headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

print('hi')
rooturl='http://www.hoovers.com/company-information/company-search.html'
r=requests.get(rooturl, headers=headers)
print('hi1')
soup=BeautifulSoup(r.content,"html.parser");
print('hi2')
print(soup)

EDIT: So odd. Now it's not working for me again. It first didn't work. Then it did. Now it doesn't. But there is another potential option with the use of Selenium.

from bs4 import BeautifulSoup
import requests
from selenium import webdriver

browser = webdriver.Chrome()
browser.get('http://www.hoovers.com/company-information/company-search.html')

r = browser.page_source
print('hi1')
soup=BeautifulSoup(r,"html.parser")
print('hi2')
print(soup)

browser.close() 

Upvotes: 1

Related Questions