Andy
Andy

Reputation: 11

Extract specific data output from beautifulsoup

I am using this script. It provides the data that I am wanting however all I need is the "Updated date" part. Trying to get rid of the text that follows after.

# import library
from bs4 import BeautifulSoup
import requests

# Request to website and download HTML contents
url='https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources'
req=requests.get(url)
content=req.text
soup=BeautifulSoup(content)
raw=soup.findAll(class_="module-content")[3].text
print(raw.strip())

This is the output I get:

Updated 1-19-2021
      
      
        
          
          

        
          
            

There are no views created for this resource yet.

The bold and italicized output is what I am trying to get and not the other item.

Upvotes: 1

Views: 124

Answers (2)

MendelG
MendelG

Reputation: 20128

You can use the find_next() method which returns the first next match:

raw=soup.findAll(class_="module-content")[3].find_next(text=True)

Full example:

from bs4 import BeautifulSoup
import requests

# Request to website and download HTML contents
url='https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources'
req=requests.get(url)
content=req.text
soup=BeautifulSoup(content, "html.parser")
raw=soup.findAll(class_="module-content")[3].find_next(text=True)
print(raw.strip())

Output:

Updated 1-19-2021

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195643

Try:

import requests
from bs4 import BeautifulSoup

url = "https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

print(soup.select_one(".inner-primary .module-content").contents[0].strip())

Prints:

Updated 1-19-2021

Upvotes: 0

Related Questions