Reputation: 30958
I'm looking for a good html parser like HtmlAgilityPack (open-source .NET project: http://www.codeplex.com/htmlagilitypack), but for using with Python.
Anyone knows?
Upvotes: 2
Views: 1028
Reputation: 15831
Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
Upvotes: 8
Reputation: 35614
Beautiful Soup should be something you search for. It is a html/xml parser that can deal with invalid pages and allows e.g. to iterate over specific tags.
Upvotes: 0