Reputation: 2067
I am working on a project to get Google search web pages and then clean HTML tags to obtain pure text content.
Any suggestion for available tools (esp. Python tools)
many thanks.
Upvotes: 1
Views: 434
Reputation: 8225
I'd check out Pattern, which is a Python web mining module providing a suite of text retrieval, analysis, and viz tools. I haven't personally used it but looks powerful.
Module pattern.web is a web toolkit that bundles various API's (Google, Gmail, Bing, Twitter, Wikipedia, Flickr) with a robust HTML parser and web spider. Its purpose is to retrieve online content in an easy-to-use, uniform way.
Upvotes: 2
Reputation: 5226
Python has a built in one that's actually pretty quick, found here. There's also a really powerful one called Beautiful Soup that offers additional functionality, especially for HTML scraping.
However, I also have to ask why not use the search API?
Upvotes: 0