Reputation: 197
I wrote this test code which uses BeautifulSoup.
url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('p'):
print(n.get_text())
It works fine but it also retrieves text that is not part of the news article, such as the time it was posted, number of comments, copyrights ect.
I would wish for it to only retrieve text from the news article itself, how would one go about this?
Upvotes: 2
Views: 806
Reputation: 185219
Be more specific, you need to catch the div
with class
articleBody
, so :
import urllib.request
from bs4 import BeautifulSoup
url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('div', attrs={'itemprop':"articleBody"}):
print(n.get_text())
Responses on SO is not just for you, but also for people coming from google searches and such. So as you can see, attrs
is a dict, it is then possible to pass more attributes/values if needed.
Upvotes: 1
Reputation: 473893
You might have much better luck with newspaper
library which is focused on scraping articles.
If we talk about BeautifulSoup
only, one option to get closer to the desired result and have more relevant paragraphs is to find them in the context of div
element with itemprop="articleBody"
attribute:
article_body = soup.find(itemprop="articleBody")
for p in article_body.find_all("p"):
print(p.get_text())
Upvotes: 2
Reputation: 6052
You'll need to target more specifically than just the p
tag. Try looking for a div class="article"
or something similar, then only grab paragraphs from there
Upvotes: 1