Gayan Jeewantha
Gayan Jeewantha

Reputation: 335

scraping issue (dynamic content)(without selenium)

I need to scrape http://www.vintagetoday.be/fr/montres but it has dynamic content.

How can I do this?

my code

import requests from bs4 import BeautifulSoup t = requests.get("vintagetoday.be/fr/catalogue.awp").text print(len(BeautifulSoup(t, "lxml").findAll("td", {"class":"Lien2"})))

results is 16 but thera are 430 articles

Upvotes: 0

Views: 1035

Answers (2)

Ayoub_B
Ayoub_B

Reputation: 700

It's normal that you're getting just 16 links instead of 430, when the page is loaded for the first time it only comes with the first 16 watches (links) in order to get more you need to scroll down the page and more watches will appear, To achieve this you can use Selenium.

A better method will be to reverse the AJAX call they are using to load the watches (paginate) and use this call directly in your code. A quick look shows that they call the following URL to load more watches (POST):

http://www.vintagetoday.be/fr/montres?AWPIDD9BBA1F0=27045E7B002DF1FE7C1BA8D48193FD1E54B2AAEB

I don't see any parameter that indicates the pagination tho, which means it's stored in the session, they also send some query string parameter with the request's body, so you need to check that as well.

The return value seems to be in XML, which will be straightforward to get the URLs from.

Upvotes: 0

ASH
ASH

Reputation: 20322

I'm definitely NOT an expert with this stuff, but I think this is what you want.

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request("http://www.vintagetoday.be/fr/montres")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, "lxml")

links = []
for link in soup.findAll('a'):
    links.append(link.get('href'))
print(links)

See the two links below for more info.

https://pythonspot.com/extract-links-from-webpage-beautifulsoup/

https://pythonprogramminglanguage.com/get-links-from-webpage/

Upvotes: 0

Related Questions