nuynuy
nuynuy

Reputation: 31

Can't find text in li under div using BeautifulSoup

I am trying to use BeautifulSoup to get the text in ul under a div in this website: https://www.nccn.org/professionals/physician_gls/recently_updated.aspx

But I only get an empty div. My code was:

page = requests.get("https://www.nccn.org/professionals/physician_gls/recently_updated.aspx")

soup=BeautifulSoup(page.content,"html.parser")

_div=soup.find("div",{"id":"divRecentlyUpdatedList"})

element = [i.text for i in b.find("a") for b in _div.find("ul")]

The results were:

The HTML file screenshot is as follows: div and ul

Also, there is javascript coming right after the div I am trying to get the content from:

div and javascript

I also tried get all li like this:

l = []
for tag in soup.ul.find_all("a", recursive=True): 
    l.append(tag.text)

But the text I got was not what I want. Is the text under that div hidden by the javascript?

Any help is welcome. Thank you very much in advance.

Upvotes: 3

Views: 504

Answers (2)

Hamatti
Hamatti

Reputation: 1220

The problem is actually the opposite of what you guessed: it's that the content inside <div id="divRecentlyUpdatedList"> is being filled with Javascript after an API call.

When using requests.get, any Javascript is not being executed on the website and thus we end up with an empty div. For this, you need to use a library that uses a headless browser so that the Javascript can be executed - for example requests-html:

from requests_html import HTMLSession
from bs4 import BeautifulSoup

URL = "https://www.nccn.org/professionals/physician_gls/recently_updated.aspx"

session = HTMLSession()
site = session.get(URL)
site.html.render()

html = site.html.html

soup = BeautifulSoup(html, 'html.parser')


_div=soup.find("div",{"id":"divRecentlyUpdatedList"})

Now in _div, you will have the rendered content from the API and you can continue finding the content you wish.

Upvotes: 1

CertainPerformance
CertainPerformance

Reputation: 371138

The content is populated into the HTML asynchronously from the endpoint https://www.nccn.org/professionals/physician_gls/GetRecentlyUpdated.ashx, which returns JSON. Since it's populated asynchronously and via JS, requests doesn't see its results.

You can request that endpoint directly and parse the JSON instead, eg:

page = requests.get("https://www.nccn.org/professionals/physician_gls/GetRecentlyUpdated.ashx")
list = json.loads(page.content)
for item in list['recent_guidelines']:
    print(item['Name'], item['VersionNumber'], item['PublishedDate'])

Upvotes: 2

Related Questions