Reputation: 87
def getPage(url):
try:
req = requests.get(url)
except requests.exceptions.RequestException:
return None
return BeautifulSoup(req.text, 'html.parser')
bs = getPage('https://www.oreilly.com/pub/e/3094')
bs.select('#contained div')
which outputs
[<div itemprop="description">
<h1 class="thankyou-hide" style="max-width:100%; font-size: 1.875em; line-height: 1.6em; margin: 30px 0 0px 0; color: #232323; font-family: 'guardian-text-oreilly',Helvetica,sans-serif; -webkit-font-smoothing: antialiased; -moz-osx-font-smoothing: grayscale; letter-spacing: -.01em; font-weight: 200;">Description:</h1>
<p>
Thanks to the growth of the Python scientific community, Python now has access to a fast and reliable set of high-performance libraries. This, combined with the elegance and power of the language, makes Python an irresistible choice for performance-critical applications.
</p>
<p>
In this webcast you will:
</p>
<ul>
<li>Learn the best tips and tricks to get the most out of the NumPy library</li>
<li>Upgrade your applications' performances by using parallel processing</li>
</ul>
<h3>About Gabriele Lanaro</h3>
<p>
Gabriele Lanaro is a PhD candidate at the University of British Columbia, in the field of molecular simulation. He writes high-performance Python code to analyze chemical systems in large-scale simulations. He created Chemlab — a high performance visualization software in Python—and emacs-for-python—a collection of Emacs extensions that facilitate working with Python code in the Emacs text editor.
</p>
</div>]
I want to use the .select()
method to return a list which includes p
and l
, so instead of just bs.select('#contained div p')
, I want something like bs.select('#contained div p & l')
. Any suggestions?
Alternatively, I want to know if it possible to select everything between h1
and h3
instead as well.
Upvotes: 0
Views: 31
Reputation: 1415
BeautifulSoup.select()
works with the usual CSS selectors. So the following should give you all of the <p>
and <li>
elements:
bs.select('#contained div p, #contained div li')
If you want to select elements between h1
and h3
specifically that's a little more complex:
h1 = bs.select_one('#contained div h1')
h3 = bs.select_one('#contained div h3')
result_set = bs.select('#contained div *')
result_set[result_set.index(h1)+1:result_set.index(h3)]
Upvotes: 1