Reputation: 189
Previously, I posted a question on how do I get the data from an AJAX website which is from this link: Scraping AJAX e-commerce site using python
I understand a bit on how to get the response which is using the chrome F12 in Network tab and do some coding with python to display the data. But I barely can't find the specific API url for it. The JSON data is not coming from a URL like the previous website, but it is in the Inspect Element in Chrome F12.
My real question actually is how do I get ONLY the JSON data using BeautifulSoup or anything related to it? After I can get only the JSON data from the application/id+json then I will convert it to be a JSON data that python can recognize so that I can display the products into table form.
One more problem is after several time I run the code, the JSON data is missing. I think the website will block my IP address. How to I solve this problem?
Here is the website link:
https://www.lazada.com.my/catalog/?_keyori=ss&from=input&page=1&q=h370m&sort=priceasc
Here is my code
from bs4 import BeautifulSoup import requests
page_link = 'https://www.lazada.com.my/catalog/?_keyori=ss&from=input&page=1&q=h370m&sort=priceasc'
page_response = requests.get(page_link, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
print(page_content)
Upvotes: 3
Views: 1812
Reputation: 51
Try:
import requests
response = requests.get(url)
data = response.json()
Upvotes: 1
Reputation: 2445
You can just use the find
method with the pointer to your <script>
tag with the attr type=application/json
Then you can use the json
package to load the value inside a dict
Here is a code sample:
from bs4 import BeautifulSoup as soup
import requests
import json
page_link = 'https://www.lazada.com.my/catalog/?_keyori=ss&from=input&page=1&q=h370m&sort=priceasc'
page_response = requests.get(page_link, timeout=5)
page_content = soup(page_response.text, "html.parser")
json_tag = page_content.find('script',{'type':'application/json'})
json_text = json_tag.get_text()
json_dict = json.loads(json_text)
print(json_dict)
EDIT: My bad, I didn't see you search type=application/ld+json
attr
As it seems to have several <script>
with this attr, you can simply use the find_all
method:
from bs4 import BeautifulSoup as soup
import requests
import json
page_link = 'https://www.lazada.com.my/catalog/?_keyori=ss&from=input&page=1&q=h370m&sort=priceasc'
page_response = requests.get(page_link, timeout=5)
page_content = soup(page_response.text, "html.parser")
json_tags = page_content.find_all('script',{'type':'application/ld+json'})
for jtag in json_tags:
json_text = jtag.get_text()
json_dict = json.loads(json_text)
print(json_dict)
Upvotes: 2
Reputation: 2930
You will have to parse data from HTML manually from your Soup
as other websites will restrict their json API
from other parties.
You can find out more details here in the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Upvotes: 1