Extract specific data output from beautifulsoup

Question

I am using this script. It provides the data that I am wanting however all I need is the "Updated date" part. Trying to get rid of the text that follows after.

# import library
from bs4 import BeautifulSoup
import requests

# Request to website and download HTML contents
url='https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources'
req=requests.get(url)
content=req.text
soup=BeautifulSoup(content)
raw=soup.findAll(class_="module-content")[3].text
print(raw.strip())

This is the output I get:

Updated 1-19-2021
      
      
        
          
          

        
          
            

There are no views created for this resource yet.

The bold and italicized output is what I am trying to get and not the other item.

MendelG · Accepted Answer

You can use the find_next() method which returns the first next match:

raw=soup.findAll(class_="module-content")[3].find_next(text=True)

Full example:

from bs4 import BeautifulSoup
import requests

# Request to website and download HTML contents
url='https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources'
req=requests.get(url)
content=req.text
soup=BeautifulSoup(content, "html.parser")
raw=soup.findAll(class_="module-content")[3].find_next(text=True)
print(raw.strip())

Output:

Updated 1-19-2021

Extract specific data output from beautifulsoup

Answers (2)

Related Questions