Python Text Extraction from parsed web pages

Question

I'm working to develop a small system for extracting content from web pages (I know it has been done, but it is a good exercise and something I need). Basically, I'm looking to extract content-content, i.e. if it is an article, I just want the article text and nothing else.

I've just started, so consider me a dumb blank slate. I'm interested in how you do it, and with what, specifically in python but I'd be interested in any

EDIT:

I've found this rather enlightening and more in tune with what I'm trying to do, so solutions, discussion, and library suggestions along 'this type of thing' appreciated.

jsj · Accepted Answer

I have done a little bit of this and I recommend the combination of Mechanize and BeautifulSoup.

I would recommend parsing the HTML tree with beautiful soup and looking for a distinctive tag that identifies the content, perhaps:

Then you can just take that node from the "soup".

Python Text Extraction from parsed web pages

Answers (1)

Related Questions