jshusko
jshusko

Reputation: 35

Confusion with BeautifulSoup.find?

I am trying to grab the universities attorneys attended at a particular law firm, but I am unsure how to grab both universities listed in this link: https://www.wlrk.com/attorney/hahn/. As seen in the first linked image, the two universities this attorney attended are under the two seperate 'li' tags.

When I run the following code, I only get the html up to the end of the first 'li' tag (as seen in the second linked image), but not the second li section hence I only get the first university "Carleton College:"

import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.wlrk.com/attorney/hahn/'
res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'})
personal_soup = soup(res.content, "html.parser")    
education = personal_soup.find("div",{'class':'attorney--education'})
education.li.a.text # 'Carleton University'

html code snippet output

Upvotes: 1

Views: 63

Answers (2)

QHarr
QHarr

Reputation: 84465

Change your parser and I would use select and target the a elements direct. 'lxml' is more forgiving and will handle the stray closing a tags which shouldn't be there. Also, find would only ever have returned first match versus find_all which returns all matches.

e.g.

<a href="/attorneys/?asf_ugs=257">Carleton College</a></a>

Stray end tag a.

From line 231, column 127; to line 231, column 130

ollege</a></a>, 2013

Stray end tag a.

From line 231, column 239; to line 231, column 242

of Law</a></a>, J.D.

source

import requests
from bs4 import BeautifulSoup as soup

url = 'https://www.wlrk.com/attorney/hahn/'
res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'})
personal_soup = soup(res.content, "lxml")    
educations = [a.text for a in personal_soup.select('.attorney--education a')]
print(educations)

Upvotes: 1

Sureshmani Kalirajan
Sureshmani Kalirajan

Reputation: 1938

The bs is fetching only the first li element. I am not sure why. If you want try using lxml, here is a way,

import lxml
from lxml import html


url = 'https://www.wlrk.com/attorney/hahn/'
res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'})

tree = html.fromstring(res.content)
education = tree.xpath("//div[@class='attorney--education']//li/a/text()")

print(education)

output:

['Carleton College', 'New York University School of Law']

Upvotes: 0

Related Questions