Reputation: 1159
My code:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myUrl = 'https://www.rebuy.de/kaufen/videospiele-nintendo-switch?
page=1'
#opening up connection, grabbing the page
uClient = uReq(myUrl)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs each product
containers = page_soup.find_all("div", class_="ry-product__item ry-
product__item--large")
I want to extract item containers that hold image, title and price from this website. When I run this code it returns empty list
[]
I am sure the code works because when I type for example class_="row"
it returns tags that this class contains.
I want to extract all the containers that have this class(Screenshot below) but it seems like I am choosing wrong class or because there are multiple classes in this <div>
tag. What am I doing wrong?
Upvotes: 1
Views: 724
Reputation: 195573
The site loads the products dynamically through Ajax. Looking at the Chrome/Firefox network inspector reveals the address of API. The site loads the product data from there (https://www.rebuy.de/api/search?page=1&categorySanitizedPath=videospiele-nintendo-switch):
import requests
import json
from pprint import pprint
headers = {}
# headers = {"Host":"www.rebuy.de",
# "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Cookie":"SET THIS TO PREVENT ACCESS DENIED",
# "Accept-Encoding":"gzip,deflate,br",
# "User-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}
url = "https://www.rebuy.de/api/search?page={}&categorySanitizedPath=videospiele-nintendo-switch"
page = 1
r = requests.get(url.format(page), headers=headers)
data = json.loads(r.text)
pprint(data['products'])
# print(json.dumps(data, indent=4, sort_keys=True))
Prints:
{'docs': [{'avg_rating': 5,
'badges': [],
'blue_price': 1999,
'category_id': {'0': 94, '1': 3098},
'category_is_accessory': False,
'category_name': 'Nintendo Switch',
'category_sanitized_name': 'nintendo-switch',
'cover_updated_at': 0,
'has_cover': True,
'has_percent_category': False,
'has_variant_in_stock': True,
'id': 10725297,
'name': 'FIFA 18',
'num_ratings': 1,
'price_min': 1999,
'price_recommended': 0,
'product_sanitized_name': 'fifa-18',
'root_category_name': 'Videospiele',
'variants': [{'label': 'A1',
'price': 2199,
'purchasePrice': 1456,
'quantity': 2},
{'label': 'A2',
'price': 1999,
'purchasePrice': 1919,
'quantity': 7},
{'label': 'A3',
'price': 1809,
'purchasePrice': 1919,
'quantity': 0},
{'label': 'A4',
'price': 1409,
'purchasePrice': 1919,
'quantity': 0}]},
...and so on.
One caveat, when many requests are made, the site returns Access Denied
. To prevent this, you need to set headers with Cookie from your browser (to get the cookie, look inside Chrome/Firefox network inspector).
Better solution would be use of Selenium.
Upvotes: 3
Reputation: 57259
The issue is that these DOM elements were loaded dynamically via AJAX. If you view the source code of this site, you won't be able to find any of these classes because they haven't been created yet. One solution is to make the same request that the page does and extract the data from the response as shown here.
Another approach is to use a tool like Selenium to load these elements and interact with them dynamically.
Here's some code to retrieve and print the fields you're interested in. Hopefully this will get you started. This requires installing Chromedriver.
Note that I took the liberty to parse the results with regex a bit, but that's not critical.
import re
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get("https://www.rebuy.de/kaufen/videospiele-nintendo-switch")
for product in driver.find_elements_by_tag_name("product"):
name_elem = product.find_element_by_class_name("ry-product-item-content__name")
print("name:\t", name_elem.get_attribute("innerHTML"))
image_elem = product.find_element_by_class_name("ry-product-item__image")
image = str(image_elem.value_of_css_property("background-image"))
print("image:\t", re.search(r"^url\((.*)\)$", image).group(1))
price_elem = product.find_element_by_class_name("ry-price__amount")
price = str(price_elem.get_attribute("innerHTML").encode("utf-8"))
print("price:\t", re.search(r"\d?\d,\d\d", price).group(0), "\n")
Output (60 results):
name: Mario Kart 8 Deluxe
image: "https://d2wr8zbg9aclns.cloudfront.net/products/010/574/253/covers/205.jpeg?time=0"
price: 43,99
name: Super Mario Odyssey
image: "https://d2wr8zbg9aclns.cloudfront.net/products/010/574/263/covers/205.jpeg?time=1508916366"
price: 40,69
...
name: South Park: Die Rektakuläre Zerreißprobe
image: "https://d2wr8zbg9aclns.cloudfront.net/products/default/205.jpeg?time=0"
price: 35,99
name: Cars 3: Driven To Win [Internationale Version]
image: "https://d2wr8zbg9aclns.cloudfront.net/products/010/967/629/covers/205.jpeg?time=1528267000"
price: 30,99
Upvotes: 2