Reputation: 149
I'm trying to grab all of the p tags within the body of an article. I was wondering if someone could explain why my code was wrong and how I could improve it. Below is the URL of the article and the relevant code. Thanks for any insight you can provide.
url: http://www.france24.com/en/20140310-libya-seize-north-korea-crude-oil-tanker-rebels-port-rebels/
import urllib2
from bs4 import BeautifulSoup
# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")
soup = BeautifulSoup(urllib2.urlopen(url).read())
# retrieve all of the paragraph tags
body = soup.find("div", {'class':'bd'}).get_text()
for tag in body:
p = soup.find_all('p')
print str(p) + '\n' + '\n'
Upvotes: 3
Views: 8864
Reputation: 473873
The problem is that there are multiple div
tags with class="bd"
on the page. Looks like you need the one that contains an actual article - it is inside of article
tag:
import urllib2
from bs4 import BeautifulSoup
# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")
soup = BeautifulSoup(urllib2.urlopen(url))
# retrieve all of the paragraph tags
paragraphs = soup.find('article').find("div", {'class': 'bd'}).find_all('p')
for paragraph in paragraphs:
print paragraph.text
prints:
Libyan government forces on Monday seized a North Korea-flagged tanker after...
...
Hope that helps.
Upvotes: 5