Reputation: 9348
How can I have BeautifulSoup only consider a certain part of contents of a webpage?
For example I want to pick up all div
tags only after the ‘Most viewed right now’ on the page http://www.dailypress.com/.
It goes:
from bs4 import BeautifulSoup
import urllib2
url = ' http://www.dailypress.com/ '
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
and I can use:
str(soup).find(' Most viewed right now')
to locate the sentence, however it’s not helpful in determining the part of the contents I want.
Upvotes: 1
Views: 1444
Reputation: 473863
Find the div
that contains the most viewed articles and find all links inside:
>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> import re
>>> url = "http://www.dailypress.com"
>>> soup = BeautifulSoup(urllib2.urlopen(url))
>>> most_viewed = soup.find('div', class_=re.compile('mostViewed'))
>>> for item in most_viewed.find_all('a'):
... print item.text.strip()
...
Body of driver recovered from Chesapeake Bay Bridge-Tunnel wreck
Hampton police looking for man linked to Friday's fatal apartment shooting
Police identify suspect in Saturday's fatal shooting in Hampton
Teen spice user: 'It's the new crack'
When spice came to Gloucester
The trick here is that we are first finding the container for Most Viewed
links - it is a div
that has mostViewed
class. You can inspect it with the help of browser developer tools.
Upvotes: 1