Reputation: 71
So I have this code that scrapes javascript content:
from requests_html import HTMLSession
#create the session
session = HTMLSession()
#define our URL
url = 'https://partalert.net/product.js?asin=B08L8LG4M3&price=%E2%82%AC702.07&smid=A3JWKAKR8XB7XF&tag=partalertde-21×tamp=16%3A33+UTC+%2821.4.2021%29&title=ASUS+DUAL+NVIDIA+GeForce+RTX+3070+OC+Edition+Gaming+Grafikkarte+%28PCIe+4.0%2C+8+GB+GDDR6+Speicher%2C+HDMI+2.1%2C+DisplayPort+1.4a%2C+Axial-tech+L%C3%BCfterdesign%2C+Dual+BIOS%2C+Schutzr%C3%BCckwand%2C+GPU+Tweak+II%29&tld=.de'
#use the session to get the data
r = session.get(url)
#Render the page, up the number on scrolldown to page down multiple times on a page
r.html.render(sleep=0, keep_page=True, scrolldown=0)
#take the rendered html and find the element that we are interested in
links = r.html.find('#href')
#loop through those elements extracting the text and link
for item in links:
link = {
'link': item.absolute_links
}
print(link)
However it takes 2-3 seconds which is way to long to load for me. Is there a way to speed it up?
Upvotes: 0
Views: 582
Reputation: 9639
There is no need to scrape the site at all. When you look at the source code you can see that javascript
is generating the Amazon url from the input url:
document.getElementById(
"href"
).href = `https://www.amazon${tld}/dp/${asin}?tag=${tag}&linkCode=ogi&th=1&psc=1&smid=${smid}`;
This means that you only have to replicate this function in python
to generate your urls. You can get the values of the url parameters with urllib.parse
, then use string formatting to generate the new url:
from urllib.parse import urlsplit, parse_qs
url = 'https://partalert.net/product.js?asin=B08L8LG4M3&price=%E2%82%AC702.07&smid=A3JWKAKR8XB7XF&tag=partalertde-21×tamp=16%3A33+UTC+%2821.4.2021%29&title=ASUS+DUAL+NVIDIA+GeForce+RTX+3070+OC+Edition+Gaming+Grafikkarte+%28PCIe+4.0%2C+8+GB+GDDR6+Speicher%2C+HDMI+2.1%2C+DisplayPort+1.4a%2C+Axial-tech+L%C3%BCfterdesign%2C+Dual+BIOS%2C+Schutzr%C3%BCckwand%2C+GPU+Tweak+II%29&tld=.de'
query = urlsplit(url).query
params = parse_qs(query)
amazon_url = f"https://www.amazon{params['tld'][0]}/dp/{params['asin'][0]}?tag={params['tag'][0]}&linkCode=ogi&th=1&psc=1&smid={params['smid'][0]}"
Result:
https://www.amazon.de/dp/B08L8LG4M3?tag=partalertde-21&linkCode=ogi&th=1&psc=1&smid=A3JWKAKR8XB7XF
Upvotes: 2