Reputation: 225
I'm scraping data from e-commerce site and I need model number of each laptops. But in div tags, there are no model numbers. I found model number inside script tag as "productCode". For this example its:
"productCode":"MGND3TU/A"
How can I gather the "productCode" data. I couldn't understand from other posts.
Edit: I find the ‘productCode’ inside script tag. But i don’t know how to get it. You can check from page source.
Upvotes: 1
Views: 620
Reputation: 25231
In this case beautifulsoup
is not needed cause response could be searched directly with regex
:
json.loads(re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r).group(1))
import requests, re, json
r = requests.get('https://www.trendyol.com/apple/macbook-air-13-m1-8gb-256gb-ssd-altin-p-67940132').text
json_data = json.loads(re.search(r"window.__PRODUCT_DETAIL_APP_INITIAL_STATE__=({.*}});window", r).group(1))
json_data['product']['productCode']
MGND3TU/A
Upvotes: 1
Reputation: 44274
Since the JSON is hidden in the <head>
, it can be parsed, but with some custom logic.
Unfortunately the script tags exports the JSON to a window var, so we'll need to remove that befor we can parse it.
<script>
PRODUCT_DETAIL_APP_INITIAL_STAT
exist in the string (valid json)Try
to parse to jsonjson['product']['productCode']
if it existsimport json
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
reqs = requests.get("https://www.trendyol.com/apple/macbook-air-13-m1-8gb-256gb-ssd-altin-p-67940132")
soup = BeautifulSoup(reqs.text, 'html.parser')
for sc in soup.findAll('script'):
if len(sc.contents) > 0 and "PRODUCT_DETAIL_APP_INITIAL_STAT" in sc.contents[0]:
withoutBegin = sc.contents[0][44:]
endIndex = withoutBegin.find('window.TYPageName=') - 1
withoutEnd = withoutBegin[:endIndex]
try:
j = json.loads(withoutEnd)
if j['product']['productCode']:
print(j['product']['productCode'])
except Exception as e:
print("Unable to parse JSON")
continue
Output:
MGND3TU/A
Upvotes: 2
Reputation: 16564
That's because those tags are generated using JavaScript. When you send a request to that URL, you will get back a response which has information for a JS script to build DOM for you. (technically JSON information):
To see what your returned response actually is, either print the value of r.text
(r
is returned from requests.get()
) or manually see the "view page source" from the browser. (not inspect element section)
Now to solve it, you can either use something that can render JS, just like your browser. For example Selenium. requests
module is not capable of rendering JS. It is just for sending and receiving requests.
Or manually extract that JSON text from the returned text (using Regex or,...) then create a Python dictionary from it.
Upvotes: 1