Reputation: 435
I have a simple python program that searches for a keyword inside a url and returns true or false. I want to modify it so i only search inside the article ,not the title , not other stuff around the webpage or ads or from other articles , etc. I have hundreds of URLS to check and they don;t have the same style(i guess, haven't checked them all but its kinda obvious). How can i do something like this if it's even possible? First time using BeautifulSoup.
Here is what i use right now
import re
import sys
from BeautifulSoup import BeautifulSoup
import urllib2
#argecho.py
content = urllib2.open(sys.argv[1]).read()
print sys.argv[2] in content # -> True
I send the url and keyword as arguments as i have another script calling this for hundreds of urls.
Upvotes: 1
Views: 2492
Reputation: 1122292
You can search for text in just the body text with BeautifulSoup, by converting sys.argv[2]
into a regular expression:
import sys
from bs4 import BeautifulSoup
import urllib2
import re
response = urllib2.urlopen(sys.argv[1])
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
text_pattern = re.compile(re.escape(sys.argv[2]))
if soup.find('body').find(text=text_pattern):
print 'Found the text in the page')
However, to narrow this down further to exclude navigation, footers, etc., you'll need to apply some heuristics. Each site is different and detecting what part of the page makes up the main text is not a straightforward task.
Instead of re-inventing that wheel, you may want to look at the Readability API instead; they've already built a large library of heuristics to parse out the 'main' part of a site for you.
Upvotes: 2
Reputation: 23322
BeautifulSoup, by itself, is incapable to extract text from the "article", since what and article is, html-wise, is entirely subjective, and will change from one site to the next. You need to write a different parser for each site.
My suggestion is to model this using inheritance:
class Webpage(object):
def __init__(self, html_string):
self.html= BeautifulSoup(html_string)
def getArticleText(self):
raise NotImplemented
class NewYorkTimesPage(Webpage):
def getArticleText(self):
return self.html.find(...)
Upvotes: 2
Reputation: 4342
There is no simple way to extract article from web page. You can use some external service that extracts content like Readability and python library for it
Upvotes: 2