Python BeautifulSoup web scraping trouble finding correct class in HTML

Question

My code:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

myUrl = 'https://www.rebuy.de/kaufen/videospiele-nintendo-switch? 
page=1'

#opening up connection, grabbing the page
uClient = uReq(myUrl)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, "html.parser")

#grabs each product
containers = page_soup.find_all("div", class_="ry-product__item ry- 
product__item--large")

I want to extract item containers that hold image, title and price from this website. When I run this code it returns empty list

[]

I am sure the code works because when I type for example class_="row" it returns tags that this class contains.

I want to extract all the containers that have this class(Screenshot below) but it seems like I am choosing wrong class or because there are multiple classes in this

tag. What am I doing wrong?

ggorlen · Accepted Answer

The issue is that these DOM elements were loaded dynamically via AJAX. If you view the source code of this site, you won't be able to find any of these classes because they haven't been created yet. One solution is to make the same request that the page does and extract the data from the response as shown here.

Another approach is to use a tool like Selenium to load these elements and interact with them dynamically.

Here's some code to retrieve and print the fields you're interested in. Hopefully this will get you started. This requires installing Chromedriver.

Note that I took the liberty to parse the results with regex a bit, but that's not critical.

import re
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()  
chrome_options.add_argument("--headless")  
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get("https://www.rebuy.de/kaufen/videospiele-nintendo-switch")

for product in driver.find_elements_by_tag_name("product"):    
    name_elem = product.find_element_by_class_name("ry-product-item-content__name")
    print("name:\t", name_elem.get_attribute("innerHTML"))

    image_elem = product.find_element_by_class_name("ry-product-item__image")
    image = str(image_elem.value_of_css_property("background-image"))
    print("image:\t", re.search(r"^url$(.*)$$", image).group(1))

    price_elem = product.find_element_by_class_name("ry-price__amount")
    price = str(price_elem.get_attribute("innerHTML").encode("utf-8"))
    print("price:\t", re.search(r"\d?\d,\d\d", price).group(0), "\n")

Output (60 results):

name:    Mario Kart 8 Deluxe
image:   "https://d2wr8zbg9aclns.cloudfront.net/products/010/574/253/covers/205.jpeg?time=0"
price:   43,99

name:    Super Mario Odyssey
image:   "https://d2wr8zbg9aclns.cloudfront.net/products/010/574/263/covers/205.jpeg?time=1508916366"
price:   40,69

...

name:    South Park: Die Rektakuläre Zerreißprobe
image:   "https://d2wr8zbg9aclns.cloudfront.net/products/default/205.jpeg?time=0"
price:   35,99

name:    Cars 3: Driven To Win [Internationale Version]
image:   "https://d2wr8zbg9aclns.cloudfront.net/products/010/967/629/covers/205.jpeg?time=1528267000"
price:   30,99

Python BeautifulSoup web scraping trouble finding correct class in HTML

Answers (2)

Related Questions