Reputation: 149
I am scraping an article using BeautifulSoup. I want to scrape all of the p tags within the article body aside from a certain section. I was wondering if someone could give me a hint as to what I am doing wrong? I didn't get an error, it just didn't present anything different. At the moment it is grabbing the word "Print" from the undesirable section and printing it with the other p tags.
Section I want to ignore: soup.find("div", {'class': 'add-this'})
url: http://www.un.org/apps/news/story.asp?NewsID=47549&Cr=burundi&Cr1=#.U0vmB8fTYig
# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())
# Retrieve all of the paragraphs
tags = soup.find("div", {'id': 'fullstory'}).find_all('p')
for tag in tags:
ptags = soup.find("div", {'class': 'add-this'})
for tag in ptags:
txt.write(tag.nextSibling.text.encode('utf-8') + '\n' + '\n')
else:
txt.write(tag.text.encode('utf-8') + '\n' + '\n')
Upvotes: 1
Views: 198
Reputation: 473763
One option is to just pass recursive=False
in order not to search p
tags inside any other elements of a fullstory
div:
tags = soup.find("div", {'id': 'fullstory'}).find_all('p', recursive=False)
for tag in tags:
print tag.text
This will grab only top-level paragraphs from the div, prints the complete article:
10 April 2014 The United Nations today called on the Government...
...
...follow up with the Government on these concerns.
Upvotes: 1