Extracting Text Using BeautifulSoup

Question

I am trying to extract text from an older webpage, and having trouble. Inspecting the source of the webpage (http://www.presidency.ucsb.edu/ws/index.php?pid=119039), the text begins:

> PARTICIPANTS:
Former Secretary of State
> Hillary Clinton (D) and
Businessman Donald Trump
> (R)MODERATOR:
Chris Wallace (Fox News)
WALLACE:
> Good evening from the Thomas and Mack Center at the University of
> Nevada, Las Vegas. I'm Chris Wallace of Fox News, and I welcome you to
> the third and final of the 2016 presidential debates between Secretary
> of State Hillary Clinton and Donald J. Trump.

I have tried extracting the text using:

link = "http://www.presidency.ucsb.edu/ws/index.php?pid=119039"
debate_response = requests.get(link)
debate_soup = BeautifulSoup(debate_response.content, 'html.parser')
debate_text = debate_soup.find_all('div',{'span class':"displaytext"})
print(debate_text)

but this just returns an empty list. Any idea how I can extract the text?

Jonathan · Accepted Answer

I had to use lxml as the parser because I was getting a max recursion error using html.parser. The following will extract all text from the tag's children into one string:

debate_soup = BeautifulSoup(debate_response.content, 'lxml')
debate_text = debate_soup.find('span', {'class': 'displaytext'}).get_text()

Extracting Text Using BeautifulSoup

Answers (1)

Related Questions