Reputation: 13
My code works for one site and not another site. Can some one help me out.
import requests
from bs4 import BeautifulSoup
URL = "https://www.homedepot.com/s/311256393"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="root")
print(results.prettify())
Where as below code shows output, is the any difference on website?
import requests
from bs4 import BeautifulSoup
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="ResultsContainer")
print(results.prettify())
Upvotes: 1
Views: 1087
Reputation: 1724
When parsing The Home Depot you need to use proxies (if your IP is outside the US, otherwise it will throw an Access denied error) and parse the data from their GraphQL API (Dev Tools -> Network -> Fetch\XHR -> find appropriate name -> Headers (opened tab on the right after clicking on the name) -> URL
) and make a request to appropriate URL address.
Then use JSON Response Content via requests
library: requests.get("URL").json()
which will decode JSON string to a Python dictionary, so example code would look something like this:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36',
# additional headers if response is not 200 (look inside "Headers tab in Devtools")
}
response = requests.post('URL', headers=headers).json()
some_variable = response['some_dict_key_from_response']
Alternatively, if you don't want to deal with bypassing blocks, you can get the desired output by using The Home Depot Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to deal with blocks as mentioned above, figure out how to scale the number of requests (if needed), and there's no need to maintain it over time (if something in the HTML will be changed). Check out the playground with a product you were looking for.
Example code to integrate and example in the online IDE:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "home_depot_product", # ↓↓↓
"product_id": "311256393" # https://www.homedepot.com/s/311256393 ←
# ↑↑↑
}
search = GoogleSearch(params)
results = search.get_dict()
title = results["product_results"]["title"]
link = results["product_results"]["link"]
price = results["product_results"]["price"]
rating = results["product_results"]["rating"]
print(title, link, price, rating, sep="\n")
# actual JSON response is much bigger
'''
20 in. x 20 in. Palace Tile Outdoor Throw Pillow with Fringe
https://www.homedepot.com/p/Hampton-Bay-20-in-x-20-in-Palace-Tile-Outdoor-Throw-Pillow-with-Fringe-7747-04413111/311256393
19.98
5.0
'''
A quick glance at available product_results
:
for key in results["product_results"]:
print(key, sep="\n")
'''
product_id
title
description
link
upc
model_number
favorite
rating
reviews
price
highlights
brand
images
bullets
specifications
fulfillment
'''
Disclaimer, I work for SerpApi.
Upvotes: 1