Reputation: 35
I am trying to grab the universities attorneys attended at a particular law firm, but I am unsure how to grab both universities listed in this link: https://www.wlrk.com/attorney/hahn/. As seen in the first linked image, the two universities this attorney attended are under the two seperate 'li' tags.
When I run the following code, I only get the html up to the end of the first 'li' tag (as seen in the second linked image), but not the second li section hence I only get the first university "Carleton College:"
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.wlrk.com/attorney/hahn/'
res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'})
personal_soup = soup(res.content, "html.parser")
education = personal_soup.find("div",{'class':'attorney--education'})
education.li.a.text # 'Carleton University'
Upvotes: 1
Views: 63
Reputation: 84465
Change your parser and I would use select
and target the a
elements direct. 'lxml' is more forgiving and will handle the stray closing a
tags which shouldn't be there. Also, find
would only ever have returned first match versus find_all
which returns all matches.
e.g.
<a href="/attorneys/?asf_ugs=257">Carleton College</a></a>
Stray end tag a.
From line 231, column 127; to line 231, column 130
ollege</a></a>, 2013
Stray end tag a.
From line 231, column 239; to line 231, column 242
of Law</a></a>, J.D.
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.wlrk.com/attorney/hahn/'
res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'})
personal_soup = soup(res.content, "lxml")
educations = [a.text for a in personal_soup.select('.attorney--education a')]
print(educations)
Upvotes: 1
Reputation: 1938
The bs is fetching only the first li element. I am not sure why. If you want try using lxml, here is a way,
import lxml
from lxml import html
url = 'https://www.wlrk.com/attorney/hahn/'
res = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'})
tree = html.fromstring(res.content)
education = tree.xpath("//div[@class='attorney--education']//li/a/text()")
print(education)
output:
['Carleton College', 'New York University School of Law']
Upvotes: 0