Reputation: 267190
currently I have a spider written in Java that logs into a supplier website and spiders the website. (using htmlunit)
It keeps the session (cookie) and even lets me enable/disable javascript etc.
I also use htmlparser (java) to help parse the html and extract the relevant information.
Does python have something similar to do this?
Upvotes: 1
Views: 2022
Reputation: 1358
Scrapy API uses urllib2 plus adds wires up some different parsers and helper routines.
Upvotes: 1
Reputation: 49226
Python has urllib2 to crawl pages, which supports password authentication and cookies.
There is also a HTMLParser for extracting html, but some people prefer the more feature-full BeatifulSoup.
Upvotes: 4