Reputation: 11
I am using this script. It provides the data that I am wanting however all I need is the "Updated date" part. Trying to get rid of the text that follows after.
# import library
from bs4 import BeautifulSoup
import requests
# Request to website and download HTML contents
url='https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources'
req=requests.get(url)
content=req.text
soup=BeautifulSoup(content)
raw=soup.findAll(class_="module-content")[3].text
print(raw.strip())
This is the output I get:
Updated 1-19-2021
There are no views created for this resource yet.
The bold and italicized output is what I am trying to get and not the other item.
Upvotes: 1
Views: 124
Reputation: 20128
You can use the find_next()
method which returns the first next match:
raw=soup.findAll(class_="module-content")[3].find_next(text=True)
Full example:
from bs4 import BeautifulSoup
import requests
# Request to website and download HTML contents
url='https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources'
req=requests.get(url)
content=req.text
soup=BeautifulSoup(content, "html.parser")
raw=soup.findAll(class_="module-content")[3].find_next(text=True)
print(raw.strip())
Output:
Updated 1-19-2021
Upvotes: 1
Reputation: 195643
Try:
import requests
from bs4 import BeautifulSoup
url = "https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print(soup.select_one(".inner-primary .module-content").contents[0].strip())
Prints:
Updated 1-19-2021
Upvotes: 0