user3285763
user3285763

Reputation: 149

scraping using beautiful soup

I am scraping an article using BeautifulSoup. I want to scrape all of the p tags within the article body aside from a certain section. I was wondering if someone could give me a hint as to what I am doing wrong? I didn't get an error, it just didn't present anything different. At the moment it is grabbing the word "Print" from the undesirable section and printing it with the other p tags.

Section I want to ignore: soup.find("div", {'class': 'add-this'})

    url: http://www.un.org/apps/news/story.asp?NewsID=47549&Cr=burundi&Cr1=#.U0vmB8fTYig

    # Parse HTML of article, aka making soup
    soup = BeautifulSoup(urllib2.urlopen(url).read())

    # Retrieve all of the paragraphs
    tags = soup.find("div", {'id': 'fullstory'}).find_all('p')
    for tag in tags:
        ptags = soup.find("div", {'class': 'add-this'})
        for tag in ptags:
            txt.write(tag.nextSibling.text.encode('utf-8') + '\n' + '\n')
        else:
            txt.write(tag.text.encode('utf-8') + '\n' + '\n')

Upvotes: 1

Views: 198

Answers (1)

alecxe
alecxe

Reputation: 473763

One option is to just pass recursive=False in order not to search p tags inside any other elements of a fullstory div:

tags = soup.find("div", {'id': 'fullstory'}).find_all('p', recursive=False)
for tag in tags:
    print tag.text

This will grab only top-level paragraphs from the div, prints the complete article:

10 April 2014  The United Nations today called on the Government...
...
...follow up with the Government on these concerns.

Upvotes: 1

Related Questions