Reputation: 51
Trying to teach myself some web scraping, just for fun. Decided to use it to look at a list of jobs posted on a website. I've gotten stuck. I want to be able to pull all the jobs listed on this page, but can't seem to get it to recognize anything deeper in the container I've made. Any suggestions are more than appreciated.
Current Code:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myURL = 'https://jobs.collinsaerospace.com/search-jobs/'
uClient = uReq(myURL)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
container = page_soup.findAll("section", {"id":"search-results-list"})
container
Sample of the container:
<section id="search-results-list">
<ul>
<li>
<a data-job-id="12394447" href="/job/melbourne/test-technician/1738/12394447">
<h2>Test Technician</h2>
<span class="job-location">Melbourne, Florida</span>
<span class="job-date-posted">06/27/2019</span>
</a>
</li>
<li>
<a data-job-id="12394445" href="/job/cedar-rapids/associate-systems-engineer/1738/12394445">
<h2>Associate Systems Engineer</h2>
<span class="job-location">Cedar Rapids, Iowa</span>
<span class="job-date-posted">06/27/2019</span>
</a>
</li>
<li>
I'm trying to understand how to actually extract the h2 level information (or really any information within the container I currently created)
Upvotes: 1
Views: 71
Reputation: 12915
If I understand correctly, you're looking to extract the headings from your container
. Here's the snippet to do that:
for child in container:
for heading in child.find_all('h2'):
print(heading.text)
Note that child
and heading
are just dummy variables I'm using to iterate through the ResultSet (that the container is) and the list (that all the headings are). For each child, I'm searching for all the tags, and for each one I'm printing its text.
If you wanted to extract something else from your container, just tweak find_all
.
Upvotes: 2
Reputation: 578
I have tried to replicate the same using lxml.
import requests
from lxml import html
resp = requests.get('https://jobs.collinsaerospace.com/search-jobs/')
data_root = html.fromstring(resp.content)
data = []
for node in data_root.xpath('//section[@id="search-results-list"]/ul/li'):
data.append({"url":node.xpath('a/@href')[0],"name":node.xpath('a/h2/text()')[0],"location":node.xpath('a/span[@class="job-location"]/text()')[0],"posted":node.xpath('a/span[@class="job-date-posted"]/text()')[0]})
print(data)
Upvotes: 2