Reputation: 489
I'm very new to Scrapy. Here my spider to crawl twistedweb.
class TwistedWebSpider(BaseSpider):
name = "twistedweb3"
allowed_domains = ["twistedmatrix.com"]
start_urls = [
"http://twistedmatrix.com/documents/current/web/howto/",
]
rules = (
Rule(SgmlLinkExtractor(),
'parse',
follow=True,
),
)
def parse(self, response):
print response.url
filename = response.url.split("/")[-1]
filename = filename or "index.html"
open(filename, 'wb').write(response.body)
When I run scrapy-ctl.py crawl twistedweb3
, it fetched only.
Getting the index.html
content, I tried using SgmlLinkExtractor
, it extract links as I expected but these links can not be followed.
Can you show me where I am going wrong?
Suppose I want to get css, javascript file. How do I achieve this? I mean get full website?
Upvotes: 0
Views: 850
Reputation: 6710
rules
attribute belongs to CrawlSpider
.Use class MySpider(CrawlSpider)
.
Also, when you use CrawlSpider
you must not override parse
method,
instead use parse_response
or other similar name.
Upvotes: 4