Reputation: 1950
I am trying to use Scrapy on a site which I do not know the URL structure of.
I would like to:
only extract data from pages which contain Xpath "//div[@class="product-view"]".
extract print (in CSV) the URL, the name and price Xpaths
When I run the below script, all I get is a random list of URL's
scrapy crawl dmoz>test.txt
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request
DOMAIN = 'site.com'
URL = 'http://%s' % DOMAIN
class MySpider(BaseSpider):
name = "dmoz"
allowed_domains = [DOMAIN]
start_urls = [
URL
]
def parse(self, response):
for url in response.xpath('//a/@href').extract():
if not ( url.startswith('http://') or url.startswith('https://') ):
url= URL + url
if response.xpath('//div[@class="product-view"]'):
url = response.extract()
name = response.xpath('//div[@class="product-name"]/h1/text()').extract()
price = response.xpath('//span[@class="product_price_details"]/text()').extract()
yield Request(url, callback=self.parse)
print url
Upvotes: 2
Views: 1543
Reputation: 21436
What you are lookin here is for scrapy.spiders.Crawlspider.
However you almost got it with your own approach. Here's the fixed version.
from scrapy.linkextractors import LinkExtractor
def parse(self, response):
# parse this page
if response.xpath('//div[@class="product-view"]'):
item = dict()
item['url'] = response.url
item['name'] = response.xpath('//div[@class="product-name"]/h1/text()').extract_first()
item['price'] = response.xpath('//span[@class="product_price_details"]/text()').extract_first()
yield item # return an item with your data
# other pages
le = LinkExtractor() # linkextractor is smarter than xpath '//a/@href'
for link in le.extract_links(response):
yield Request(link.url) # default callback is already self.parse
Now you can simply run scrapy crawl myspider -o results.csv
and scrapy will output csv of your items. Though keep an eye on the log and the stats bit at the end especially, that's how you know if something went wrong
Upvotes: 2