Beautiful Soup returns script language instead of HTML

Question

I made a python program to scrape data from a couple shopping sites, which was working fine on both, until recently.

URL1 - https://www.auchan.pt/pt/alimentacao/alimentacao-bebe-e-crianca/papa-e-farinha-lactea/farinha-cerelac-lactea-500g/70511.html

URL2 - https://www.continente.pt/produto/papa-infantil-farinha-lactea-6m-cerelac-2004388.html

I use the following simple code:

import requests
from bs4 import BeautifulSoup

response = requests.get(url)

soup = BeautifulSoup(response.content, "html.parser")

# ... and then i have my code to parse stuff...

Problem is: on URL1 everything is nice and dandy, and if I print(soup), I get the page HTML as seen on the page source using a browser. But on URL2, I get what seems to be script code (please see the attached image), and of course my parsing code then fails because it can't find the elements. If I open the webpage on a browser, it looks good and I can view the source code as expected.

image

I am obviously a newbie, but seems some kind of protection against scrapping; is there anything I can do?

Thanks!

kggn · Accepted Answer

The "script language" you're seeing is minimized JS. I assume it makes a request to a central server at Continente and then populates the page. The easiest way to do this would be to use a chromedriver which executes the code and populates the page for you functioning almost identically to that of a browser.

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()

driver.get("https://www.continente.pt/produto/papa-infantil-farinha-lactea-6m-cerelac-2004388.html")

soup = BeautifulSoup(driver.page_source, "html.parser")

# ...

Beautiful Soup returns script language instead of HTML

Answers (1)

Related Questions