Reputation: 13
I'm trying to web scrape an ecommerce website. However, the page is dynamic. Within the html source code is the script that generates a json format of the products.
My code is
from bs4 import BeautifulSoup, SoupStrainer
import requests
import json
url = "https://www.lazada.com.ph/chuwi-pilipinas/?q=All-Products&langFlag=en&from=wangpu&lang=en&pageTypeId=2"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data,'html.parser')
scripts = soup.find_all('script')
jsonObj = None
for script in scripts:
if 'window.pageData = ' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('window.pageData = ')[1]
jsonObj = json.loads(jsonStr)
products = jsonObj['mods']['listItems']
for item in products:
print (item['productUrl'])
the result is:
PS C:\Users\nate\Documents\Python\LazadaScapper> & "C:/Program Files/Python39/python.exe" c:/Users/nate/Documents/Python/LazadaScapper/LazadaScraper3.py
Traceback (most recent call last):
File "c:\Users\nate\Documents\Python\LazadaScapper\LazadaScraper3.py", line 21, in <module>
products = jsonObj['mods']['listItems']
TypeError: 'NoneType' object is not subscriptable
PS C:\Users\nate\Documents\Python\LazadaScapper>
I did a research and it seems that for loop doesn't work thus, dictionary products is empty.
This is related to this thread that was posted 2 years ago but not working anymore.
I'm new at python and still studying, I hope you guys can help me.
Upvotes: 1
Views: 645
Reputation: 195543
The issue is beautifulsoup
doesn't parse the content of <script>
property into .text
, you have to use .contents
(the type is bs4.element.Script
):
from bs4 import BeautifulSoup, SoupStrainer
import requests
import json
url = "https://www.lazada.com.ph/chuwi-pilipinas/?q=All-Products&langFlag=en&from=wangpu&lang=en&pageTypeId=2"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data, "html.parser")
scripts = soup.find_all("script")
jsonObj = None
for script in scripts:
if script.contents and "window.pageData = " in script.contents[0]:
jsonStr = script.contents[0]
jsonStr = jsonStr.split("window.pageData = ")[1].strip().strip(";")
jsonObj = json.loads(jsonStr)
products = jsonObj["mods"]["listItems"]
for item in products:
print(item["productUrl"])
Prints:
//www.lazada.com.ph/products/chuwi-hi10x-2-in-1-tablet-with-detachable-keyboard-and-stylus-i2197648497-s9878152829.html?mp=1
//www.lazada.com.ph/products/chuwi-herobook-pro-intel-celeron-windows-10-home-i2194930372-s9864035095.html?mp=1
//www.lazada.com.ph/products/chuwi-mijabook-intel-celeron-n3450-3k-display-i2194877054-s9863142699.html?mp=1
//www.lazada.com.ph/products/chuwi-aerobook-pro-intel-core-m3-windows-10-home-i2189380140-s9832528924.html?mp=1
//www.lazada.com.ph/products/chuwi-gemibook-intel-celeron-windows-10-home-i2189593108-s9833799252.html?mp=1
//www.lazada.com.ph/products/chuwi-corebook-pro-intel-core-i3-windows-10-home-i2189120736-s9831912160.html?mp=1
//www.lazada.com.ph/products/chuwi-corebox-pro-intel-core-i3-i2206581951-s9920301744.html?mp=1
//www.lazada.com.ph/products/chuwi-hi-dock-4-ports-usb-charger-i2234845803-s10064267033.html?mp=1
//www.lazada.com.ph/products/chuwi-herobox-mini-pc-intel-celeron-n4100-i2206416268-s9919983007.html?mp=1
Upvotes: 2