beautiful soup article scraping

Question

I'm trying to grab all of the p tags within the body of an article. I was wondering if someone could explain why my code was wrong and how I could improve it. Below is the URL of the article and the relevant code. Thanks for any insight you can provide.

url: http://www.france24.com/en/20140310-libya-seize-north-korea-crude-oil-tanker-rebels-port-rebels/

import urllib2
from bs4 import BeautifulSoup

# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")

soup = BeautifulSoup(urllib2.urlopen(url).read())

# retrieve all of the paragraph tags
body = soup.find("div", {'class':'bd'}).get_text()
for tag in body:
    p = soup.find_all('p')
    print str(p) + '
' + '
'

alecxe · Accepted Answer

The problem is that there are multiple div tags with class="bd" on the page. Looks like you need the one that contains an actual article - it is inside of article tag:

import urllib2
from bs4 import BeautifulSoup

# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")

soup = BeautifulSoup(urllib2.urlopen(url))

# retrieve all of the paragraph tags
paragraphs = soup.find('article').find("div", {'class': 'bd'}).find_all('p')
for paragraph in paragraphs:
    print paragraph.text

prints:

Libyan government forces on Monday seized a North Korea-flagged tanker after...
...

Hope that helps.

beautiful soup article scraping

Answers (1)

Related Questions