Firdhaus Saleh
Firdhaus Saleh

Reputation: 189

Scraping JSON data from e-commerce Ajax site with Python

Previously, I posted a question on how do I get the data from an AJAX website which is from this link: Scraping AJAX e-commerce site using python

I understand a bit on how to get the response which is using the chrome F12 in Network tab and do some coding with python to display the data. But I barely can't find the specific API url for it. The JSON data is not coming from a URL like the previous website, but it is in the Inspect Element in Chrome F12.

enter image description here enter image description here


  1. My real question actually is how do I get ONLY the JSON data using BeautifulSoup or anything related to it? After I can get only the JSON data from the application/id+json then I will convert it to be a JSON data that python can recognize so that I can display the products into table form.

  2. One more problem is after several time I run the code, the JSON data is missing. I think the website will block my IP address. How to I solve this problem?


Here is the website link:

https://www.lazada.com.my/catalog/?_keyori=ss&from=input&page=1&q=h370m&sort=priceasc

Here is my code

from bs4 import BeautifulSoup import requests

page_link = 'https://www.lazada.com.my/catalog/?_keyori=ss&from=input&page=1&q=h370m&sort=priceasc'

page_response = requests.get(page_link, timeout=5)

page_content = BeautifulSoup(page_response.content, "html.parser")

print(page_content)

Upvotes: 3

Views: 1812

Answers (3)

coder great
coder great

Reputation: 51

Try:

import requests

response = requests.get(url)
data = response.json()

Upvotes: 1

Maaz
Maaz

Reputation: 2445

You can just use the find method with the pointer to your <script> tag with the attr type=application/json

Then you can use the json package to load the value inside a dict

Here is a code sample:

from bs4 import BeautifulSoup as soup
import requests
import json

page_link = 'https://www.lazada.com.my/catalog/?_keyori=ss&from=input&page=1&q=h370m&sort=priceasc'
page_response = requests.get(page_link, timeout=5)
page_content = soup(page_response.text, "html.parser")

json_tag = page_content.find('script',{'type':'application/json'})
json_text = json_tag.get_text()
json_dict = json.loads(json_text)
print(json_dict)

EDIT: My bad, I didn't see you search type=application/ld+json attr As it seems to have several <script>with this attr, you can simply use the find_all method:

from bs4 import BeautifulSoup as soup
import requests
import json

page_link = 'https://www.lazada.com.my/catalog/?_keyori=ss&from=input&page=1&q=h370m&sort=priceasc'
page_response = requests.get(page_link, timeout=5)
page_content = soup(page_response.text, "html.parser")

json_tags = page_content.find_all('script',{'type':'application/ld+json'})
for jtag in json_tags:
    json_text = jtag.get_text()
    json_dict = json.loads(json_text)
    print(json_dict)

Upvotes: 2

Đ&#224;o Minh Hạt
Đ&#224;o Minh Hạt

Reputation: 2930

You will have to parse data from HTML manually from your Soup as other websites will restrict their json API from other parties.

You can find out more details here in the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Upvotes: 1

Related Questions