PythonStarter
PythonStarter

Reputation: 51

How to do scraping from a page with BeautifulSoup

The question asked is very simple, but for me, it doesn't work and I don't know!

I want to scrape the rating beer from this page https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone with BeautifulSoup, but it doesn't work.

This is my code:

import requests
import bs4
from bs4 import BeautifulSoup



url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'

test_html = requests.get(url).text

soup = BeautifulSoup(test_html, "lxml")

rating = soup.findAll("span", class_="ratingValue")

rating

When I finish, it doesn't work, but if I do the same thing with another page is work... I don't know. Someone can help me? The result of rating is 4.58

Upvotes: 2

Views: 226

Answers (5)

Yash Shukla
Yash Shukla

Reputation: 141

    import requests
    from bs4 import BeautifulSoup


    headers = {
   'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
 AppleWebKit/537.36 
   (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
   }

 url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southerntier-pumking 
clone'

test_html = requests.get(url, headers=headers).text

soup = BeautifulSoup(test_html, 'html5lib')

rating = soup.find('span', {'itemprop': 'ratingValue'})

 print(rating.text)

Upvotes: 0

raunak rathi
raunak rathi

Reputation: 95

The page you are requesting response as 403 forbidden so you might not be getting an error but it will provide you blank result as []. To avoid this situation we add user agent and this code will get you the desired result.

import urllib.request
from bs4 import BeautifulSoup

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'

url = "https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone"
headers={'User-Agent':user_agent} 

request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
soup = BeautifulSoup(response, "lxml")

rating = soup.find('span', {'itemprop':'ratingValue'})

rating.text

Upvotes: 0

ans2human
ans2human

Reputation: 2357

The reason behind getting forbidden status code (HTTP error 403) which means the server will not fulfill your request despite understanding the response. You will definitely get this error if you try scrape a lot of the more popular websites which will have security features to prevent bots. So you need to disguise your request!

  1. For that you need use Headers.
  2. Also you need correct your tag attribute whose data you're trying to get i.e. itemprop
  3. use lxml as your tree builder, or any other of your choice

    import requests
    from bs4 import BeautifulSoup
    
    
    url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'
    
    # Add this 
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
    
    test_html = requests.get(url, headers=headers).text      
    
    soup = BeautifulSoup(test_html, 'lxml')
    
    rating = soup.find('span', {'itemprop':'ratingValue'})
    
    print(rating.text)
    

Upvotes: 0

taseikyo
taseikyo

Reputation: 126

If you print the test_html, you'll find you get a 403 forbidden response.

You should add a header (at least a user-agent : ) ) to your GET request.

import requests
from bs4 import BeautifulSoup


headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}

url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'

test_html = requests.get(url, headers=headers).text

soup = BeautifulSoup(test_html, 'html5lib')

rating = soup.find('span', {'itemprop': 'ratingValue'})

print(rating.text)

# 4.58

Upvotes: 2

Rajat
Rajat

Reputation: 118

you are facing this error because some websites can't be scraped by beautiful soup. So for these kinds of websites, you have to use selenium enter image description here

  • download latest chrome driver from this link according to your operating system
  • install selenium driver by this command "pip install selenium"
# import required modules 
import selenium
from selenium import webdriver
from bs4 import BeautifulSoup
import time, os

curren_dir  = os.getcwd()
print(curren_dir)

# concatinate web driver with your current dir && if you are using window change "/" to '\' .

# make sure , you placed chromedriver in current directory 
driver = webdriver.Chrome(curren_dir+'/chromedriver')
# driver.get open url on your browser 
driver.get('https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone')
time.sleep(1)

# it fetch data html data from driver
super_html = driver.page_source

# now convert raw data with 'html.parser'

soup=BeautifulSoup(super_html,"html.parser")
rating = soup.findAll("span",itemprop="ratingValue")
rating[0].text

Upvotes: -1

Related Questions