Reputation: 432
I am trying to crawl some attributes from all(#123) detail pages given on this category page - http://stinkybklyn.com/shop/cheese/ but scrapy is not able to follow link pattern I set, I checked on scrapy documentation and some tutorials as well but No Luck!
Below is the code:
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class Stinkybklyn(CrawlSpider):
name = "Stinkybklyn"
allowed_domains = ["stinkybklyn.com"]
start_urls = [
"http://stinkybklyn.com/shop/cheese/chandoka",
]
Rule(LinkExtractor(allow=r'\/shop\/cheese\/.*'),
callback='parse_items', follow=True)
def parse_items(self, response):
print "response", response
hxs= HtmlXPathSelector(response)
title=hxs.select("//*[@id='content']/div/h4").extract()
title="".join(title)
title=title.strip().replace("\n","").lstrip()
print "title is:",title
Can someone please advise what wrong I am doing here?
Upvotes: 2
Views: 2920
Reputation: 473763
The key problem with your code is that you have not set the rules
for the CrawlSpider
.
Other improvements I would suggest:
HtmlXPathSelector
, you can use response
directlyselect()
is deprecated now, use xpath()
text()
of the title
element in order to retrieve, for instance, get Chandoka
instead of <h4>Chandoka</h4>
The complete code with the applied improvements:
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class Stinkybklyn(CrawlSpider):
name = "Stinkybklyn"
allowed_domains = ["stinkybklyn.com"]
start_urls = [
"http://stinkybklyn.com/shop/cheese",
]
rules = [
Rule(LinkExtractor(allow=r'\/shop\/cheese\/.*'), callback='parse_items', follow=True)
]
def parse_items(self, response):
title = response.xpath("//*[@id='content']/div/h4/text()").extract()
title = "".join(title)
title = title.strip().replace("\n", "").lstrip()
print "title is:", title
Upvotes: 3
Reputation: 1712
Seems like you have some syntax errors. Try this,
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
class Stinkybklyn(CrawlSpider):
name = "Stinkybklyn"
allowed_domains = ["stinkybklyn.com"]
start_urls = [
"http://stinkybklyn.com/shop/cheese/",
]
rules = (
Rule(LinkExtractor(allow=(r'/shop/cheese/')), callback='parse_items'),
)
def parse_items(self, response):
print "response", response
Upvotes: 0