Fabio Salinas
Fabio Salinas

Reputation: 63

How to scrape with BeautifulSoup waiting a second to save the soup element to let elements load complete in the page

i'm trying to scrape data from THIS WEBSITE that have 3 kind of prices in some products, (muted price, red price and black price), i observed that the red price change before the page load when the product have 3 prices.

When i scrape the website i get just two prices, i think if the code wait until the page fully load i will get all the prices.

Here is my code:

url='https://www.exito.com/televisor-led-samsung-55-pulgadas-uhd-4k-smart-tv-serie-7-24449/p'
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")

# Muted Price
MutedPrice = soup.find_all("span",{'class':'exito-vtex-components-2-x-listPriceValue ph2 dib strike custom-list-price fw5 exito-vtex-component-precio-tachado'})[0].text
MutedPrice=pd.to_numeric(MutedPrice[2-len(MutedPrice):].replace('.',''))

# Red Price
RedPrice = soup.find_all("span",{'class':'exito-vtex-components-2-x-sellingPrice fw1 f3 custom-selling-price dib ph2 exito-vtex-component-precio-rojo'})[0].text
RedPrice=pd.to_numeric(RedPrice[2-len(RedPrice):].replace('.',''))

# black Price
BlackPrice = soup.find_all("span",{'class':'exito-vtex-components-2-x-alliedPrice fw1 f3 custom-selling-price dib ph2 exito-vtex-component-precio-negro'})[0].text
BlackPrice=pd.to_numeric(BlackPrice[2-len(BlackPrice):].replace('.',''))

print('Muted Price:',MutedPrice)
print('Red Price:',RedPrice)
print('Black Price:',BlackPrice)

Actual Results: Muted Price: 3199900 Red Price: 1649868 Black Price: 0

Expected Results: Muted Price: 3199900 Red Price: 1550032 Black Price: 1649868

Upvotes: 4

Views: 976

Answers (2)

Pierre
Pierre

Reputation: 1099

The page you are trying to scrape contains JavaScript code, which is executed by your browser and modifies the page after it is downloaded. If you want to perform extractions on the "final state" of the page, you need to run the JavaScript code on the page using a library dedicated to that. Unfortunately, BeautifulSoup does not have this functionality, and you will need to use another library to achieve your task.

For example, you can pip install requests-html and run the following:

#!/usr/bin/env python3

import re
from requests_html import HTMLSession

def parse_price_text(price_text):
    """Extract just the price digits and dots from the <span> tag text"""
    matches = re.search("([\d\.]+)", price_text)
    if not matches:
        raise RuntimeError(f"Could not parse price text: {price_text}")

    return matches.group(1)

# Starting a session and running the JavaScript code with render()
# to make sure the DOM is the same as when using the browser.
session = HTMLSession()
exito_url = "https://www.exito.com/televisor-led-samsung-55-pulgadas-uhd-4k-smart-tv-serie-7-24449/p"
response = session.get(exito_url)
response.html.render()

# Define all price types and their associated CSS class
price_types = {
    "listPrice": "exito-vtex-components-2-x-listPriceValue",
    "sellingPrice": "exito-vtex-components-2-x-sellingPrice",
    "alliedPrice": "exito-vtex-components-2-x-alliedPrice"
}

# Iterate over price types and extract them from the page
for price_type, price_css_class in price_types.items():
    price = parse_price_text(response.html.find(f"span.{price_css_class}", first=True).text)
    print(f"{price_type} price: {price} $")

It prints the following:

listPrice price: 3.199.900 $
sellingPrice price: 1.550.032 $
alliedPrice price: 1.649.868 $

Upvotes: 0

Rithin Chalumuri
Rithin Chalumuri

Reputation: 1839

It might be that those values are rendered dynamically i.e. the values might be populated by javascript in the page.

requests.get() simply returns the markup received from the server without any further client-side changes so it's not fully about waiting.

You could perhaps use Selenium Chrome Webdriver to load the page URL and get the page source. (Or you can use Firefox driver).

Go to chrome://settings/help check your current chrome version and download the driver for that version from here. Make sure to either keep the driver file in your PATH or the same folder where your python script is.

Try replace top 3 lines of your existing code with this:

from contextlib import closing
from selenium.webdriver import Chrome # pip install selenium

url='https://www.exito.com/televisor-led-samsung-55-pulgadas-uhd-4k-smart-tv-serie-7-24449/p'

# use Chrome to get page with javascript generated content
with closing(Chrome(executable_path="./chromedriver")) as browser:
     browser.get(url)
     page_source = browser.page_source

soup = BeautifulSoup(page_source, "lxml")

Outputs:

Muted Price: 3199900
Red Price: 1550032
Black Price: 1649868

References:

Get page generated with Javascript in Python

selenium - chromedriver executable needs to be in PATH

Upvotes: 2

Related Questions