Reputation: 31
I am trying to use BeautifulSoup to get the text in ul under a div in this website: https://www.nccn.org/professionals/physician_gls/recently_updated.aspx
But I only get an empty div. My code was:
page = requests.get("https://www.nccn.org/professionals/physician_gls/recently_updated.aspx")
soup=BeautifulSoup(page.content,"html.parser")
_div=soup.find("div",{"id":"divRecentlyUpdatedList"})
element = [i.text for i in b.find("a") for b in _div.find("ul")]
The results were:
The HTML file screenshot is as follows: div and ul
Also, there is javascript coming right after the div I am trying to get the content from:
I also tried get all li like this:
l = []
for tag in soup.ul.find_all("a", recursive=True):
l.append(tag.text)
But the text I got was not what I want. Is the text under that div hidden by the javascript?
Any help is welcome. Thank you very much in advance.
Upvotes: 3
Views: 504
Reputation: 1220
The problem is actually the opposite of what you guessed: it's that the content inside <div id="divRecentlyUpdatedList">
is being filled with Javascript after an API call.
When using requests.get
, any Javascript is not being executed on the website and thus we end up with an empty div. For this, you need to use a library that uses a headless browser so that the Javascript can be executed - for example requests-html
:
from requests_html import HTMLSession
from bs4 import BeautifulSoup
URL = "https://www.nccn.org/professionals/physician_gls/recently_updated.aspx"
session = HTMLSession()
site = session.get(URL)
site.html.render()
html = site.html.html
soup = BeautifulSoup(html, 'html.parser')
_div=soup.find("div",{"id":"divRecentlyUpdatedList"})
Now in _div
, you will have the rendered content from the API and you can continue finding the content you wish.
Upvotes: 1
Reputation: 371138
The content is populated into the HTML asynchronously from the endpoint https://www.nccn.org/professionals/physician_gls/GetRecentlyUpdated.ashx, which returns JSON. Since it's populated asynchronously and via JS, requests
doesn't see its results.
You can request that endpoint directly and parse the JSON instead, eg:
page = requests.get("https://www.nccn.org/professionals/physician_gls/GetRecentlyUpdated.ashx")
list = json.loads(page.content)
for item in list['recent_guidelines']:
print(item['Name'], item['VersionNumber'], item['PublishedDate'])
Upvotes: 2