Reputation: 4894
I'm working to develop a small system for extracting content from web pages (I know it has been done, but it is a good exercise and something I need). Basically, I'm looking to extract content-content, i.e. if it is an article, I just want the article text and nothing else.
I've just started, so consider me a dumb blank slate. I'm interested in how you do it, and with what, specifically in python but I'd be interested in any
EDIT:
I've found this rather enlightening and more in tune with what I'm trying to do, so solutions, discussion, and library suggestions along 'this type of thing' appreciated.
Upvotes: 2
Views: 1521
Reputation: 9381
I have done a little bit of this and I recommend the combination of Mechanize and BeautifulSoup.
I would recommend parsing the HTML tree with beautiful soup and looking for a distinctive tag that identifies the content, perhaps:
<div id="article">
Then you can just take that node from the "soup".
Upvotes: 1