Reputation: 435

Get the text from an article BeautifulSoup

I have a simple python program that searches for a keyword inside a url and returns true or false. I want to modify it so i only search inside the article ,not the title , not other stuff around the webpage or ads or from other articles , etc. I have hundreds of URLS to check and they don;t have the same style(i guess, haven't checked them all but its kinda obvious). How can i do something like this if it's even possible? First time using BeautifulSoup.

Here is what i use right now

import re
import sys
from BeautifulSoup import BeautifulSoup
import urllib2

#argecho.py

content = urllib2.open(sys.argv[1]).read()

print sys.argv[2] in content # -> True

I send the url and keyword as arguments as i have another script calling this for hundreds of urls.

Upvotes: 1

Answers (3)

Martijn Pieters

Reputation: 1122292

You can search for text in just the body text with BeautifulSoup, by converting sys.argv[2] into a regular expression:

import sys
from bs4 import BeautifulSoup
import urllib2
import re

response = urllib2.urlopen(sys.argv[1])
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
text_pattern = re.compile(re.escape(sys.argv[2]))

if soup.find('body').find(text=text_pattern):
    print 'Found the text in the page')

However, to narrow this down further to exclude navigation, footers, etc., you'll need to apply some heuristics. Each site is different and detecting what part of the page makes up the main text is not a straightforward task.

Instead of re-inventing that wheel, you may want to look at the Readability API instead; they've already built a large library of heuristics to parse out the 'main' part of a site for you.

Upvotes: 2

loopbackbee

Reputation: 23322

BeautifulSoup, by itself, is incapable to extract text from the "article", since what and article is, html-wise, is entirely subjective, and will change from one site to the next. You need to write a different parser for each site.

My suggestion is to model this using inheritance:

class Webpage(object):
    def __init__(self, html_string):
        self.html= BeautifulSoup(html_string)
    def getArticleText(self):
        raise NotImplemented

class NewYorkTimesPage(Webpage):
    def getArticleText(self):
        return self.html.find(...)

Upvotes: 2

Nazarii Bardiuk

Reputation: 4342

There is no simple way to extract article from web page. You can use some external service that extracts content like Readability and python library for it

Upvotes: 2

Get the text from an article BeautifulSoup

Answers (3)

Related Questions