Confusion with BeautifulSoup.find?

Question

I am trying to grab the universities attorneys attended at a particular law firm, but I am unsure how to grab both universities listed in this link: https://www.wlrk.com/attorney/hahn/. As seen in the first linked image, the two universities this attorney attended are under the two seperate 'li' tags.

When I run the following code, I only get the html up to the end of the first 'li' tag (as seen in the second linked image), but not the second li section hence I only get the first university "Carleton College:"

import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.wlrk.com/attorney/hahn/'
res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'})
personal_soup = soup(res.content, "html.parser")    
education = personal_soup.find("div",{'class':'attorney--education'})
education.li.a.text # 'Carleton University'

html code snippet output

QHarr · Accepted Answer

Change your parser and I would use select and target the a elements direct. 'lxml' is more forgiving and will handle the stray closing a tags which shouldn't be there. Also, find would only ever have returned first match versus find_all which returns all matches.

e.g.

Carleton College

Stray end tag a.

From line 231, column 127; to line 231, column 130

ollege, 2013

Stray end tag a.

From line 231, column 239; to line 231, column 242

of Law, J.D.

source

import requests
from bs4 import BeautifulSoup as soup

url = 'https://www.wlrk.com/attorney/hahn/'
res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'})
personal_soup = soup(res.content, "lxml")    
educations = [a.text for a in personal_soup.select('.attorney--education a')]
print(educations)

Confusion with BeautifulSoup.find?

Answers (2)

Related Questions