Reputation: 782
I've built a crawler to crawl a particular website using Scrapy. The crawler follows if a url matches the given regex and calls the callback function if url matches other defined regex. The main purpose to build the crawler was to extract all the required links within the website rather than the contents inside the link. Can anyone tell me how to print the list of all the crawled links. The code is:
name = "xyz"
allowed_domains = ["xyz.com"]
start_urls = ["http://www.xyz.com/Vacanciess"]
rules = (Rule(SgmlLinkExtractor(allow=[regex2]),callback='parse_item'),Rule(SgmlLinkExtractor(allow=[regex1]), follow=True),)
def parse_item(self, response):
#sel = Selector(response)
#title = sel.xpath("//h1[@class='no-bd']/text()").extract()
#print title
print response
The
print title
code works perfectly well. But as in the above code if i try t print the actual response, it returns me:
[xyz] DEBUG: Crawled (200)<GET http://www.xyz.com/urlmatchingregex2> (referer: http://www.xyz.com/urlmatchingregex1)
<200 http://www.xyz.com/urlmatchingregex2>
Anyone please help me to retrieve the actual url.
Upvotes: 0
Views: 395
Reputation: 12092
You can print response.url
in parse_item
method to print the url crawled. It is documented here.
Upvotes: 1