Reputation: 3
I am learning BeautifulSoup and i tried to extract all the "a" tags from a website. I am getting lot of "a" tags but few of them are ignored and i am confused why that is happening any help will be highly appreciated.
Link i used is : https://www.w3schools.com/python/
img : https://ibb.co/mmEKTK
red box in the image is a section that has been totally ignored by the bs4. It does contains "a" tags.
Code:
import requests
import bs4
import re
import html5lib
res = requests.get('https://www.w3schools.com/python/')
soup = bs4.BeautifulSoup(res.text,'html5lib')
try:
links_with_text = []
for a in soup.find_all('a', href=True):
print(a['href'])
except:
print ('none')
sorry for the code indentation i am new here.
Upvotes: 0
Views: 116
Reputation: 2821
The links which are being ignored by bs4 are dynamically rendered i.e Advertisements etc were not present in the HTML code but have been called by scripts based on your browser habits. requests package will only fetch static HTML content, you need to simulate browser to get the dynamic content.
Selenium can be used with any browser like Chrome, Firefox etc. If you want to achieve the same results on server (without UI), use headless browsers like Phatomjs.
Upvotes: 1