zeusbella
zeusbella

Reputation: 59

Not getting the entire <li> line using BeautifulSoup

I am using BeautifulSoup to extract the list items under the class "secondary-nav-main-links" from the https://www.champlain.edu/current-students web page. I thought my working code below would extract the entire "li" line but the last portion, "/li", is placed on its own line. I included screen captures of the current output and the indended output. Any ideas? Thanks!!

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://www.champlain.edu/current-students')
bs = BeautifulSoup(html.read(), 'html.parser')
soup = bs.find(class_='secondary-nav secondary-nav-sm has-callouts')
for div in soup.find_all('li'):
    print(div)

Current output: capture1

Intended output: capture2

Upvotes: 0

Views: 77

Answers (1)

Aven Desta
Aven Desta

Reputation: 2443

You can remove the newline character with str.replace And you can unescape html characters like & with html.unescape

str(div).replace('\n','')

To replace & with &, add this to the print statement

import html
html.unescape(str(div))

So your code becomes

from urllib.request import urlopen
from bs4 import BeautifulSoup
import html

html = urlopen('https://www.champlain.edu/current-students')
bs = BeautifulSoup(html.read(), 'html.parser')
soup = bs.find(class_='secondary-nav secondary-nav-sm has-callouts')
for div in soup.find_all('li'):
    print(html.unescape(str(div).replace('\n','')))

Upvotes: 1

Related Questions