Reputation: 335
I need to scrape http://www.vintagetoday.be/fr/montres but it has dynamic content.
How can I do this?
import requests from bs4 import BeautifulSoup t = requests.get("vintagetoday.be/fr/catalogue.awp").text print(len(BeautifulSoup(t, "lxml").findAll("td", {"class":"Lien2"})))
Upvotes: 0
Views: 1035
Reputation: 700
It's normal that you're getting just 16 links instead of 430, when the page is loaded for the first time it only comes with the first 16 watches (links) in order to get more you need to scroll down the page and more watches will appear, To achieve this you can use Selenium.
A better method will be to reverse the AJAX call they are using to load the watches (paginate) and use this call directly in your code. A quick look shows that they call the following URL to load more watches (POST):
http://www.vintagetoday.be/fr/montres?AWPIDD9BBA1F0=27045E7B002DF1FE7C1BA8D48193FD1E54B2AAEB
I don't see any parameter that indicates the pagination tho, which means it's stored in the session, they also send some query string parameter with the request's body, so you need to check that as well.
The return value seems to be in XML, which will be straightforward to get the URLs from.
Upvotes: 0
Reputation: 20322
I'm definitely NOT an expert with this stuff, but I think this is what you want.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
req = Request("http://www.vintagetoday.be/fr/montres")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
print(links)
See the two links below for more info.
https://pythonspot.com/extract-links-from-webpage-beautifulsoup/
https://pythonprogramminglanguage.com/get-links-from-webpage/
Upvotes: 0