Using BeautifulSoup only consider a certain part of contents of a webpage

Question

How can I have BeautifulSoup only consider a certain part of contents of a webpage?

For example I want to pick up all div tags only after the ‘Most viewed right now’ on the page http://www.dailypress.com/.

It goes:

from bs4 import BeautifulSoup
import urllib2

url = ' http://www.dailypress.com/ '
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

and I can use:

str(soup).find(' Most viewed right now')

to locate the sentence, however it’s not helpful in determining the part of the contents I want.

alecxe · Accepted Answer

Find the div that contains the most viewed articles and find all links inside:

>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> import re
>>> url = "http://www.dailypress.com"
>>> soup = BeautifulSoup(urllib2.urlopen(url))
>>> most_viewed = soup.find('div', class_=re.compile('mostViewed'))
>>> for item in most_viewed.find_all('a'):
...     print item.text.strip()
... 
Body of driver recovered from Chesapeake Bay Bridge-Tunnel wreck
Hampton police looking for man linked to Friday's fatal apartment shooting
Police identify suspect in Saturday's fatal shooting in Hampton
Teen spice user: 'It's the new crack'
When spice came to Gloucester

The trick here is that we are first finding the container for Most Viewed links - it is a div that has mostViewed class. You can inspect it with the help of browser developer tools.

Using BeautifulSoup only consider a certain part of contents of a webpage

Answers (1)

Related Questions