Reputation: 1075
I am trying to scrape a website by extracting the sub-links and their titles, and then save the extracted titles and their associated links into a CSV file. I run the following code, the CSV file is created but it is empty. Any help?
My Spider.py file looks like this:
from scrapy import cmdline
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class HyperLinksSpider(CrawlSpider):
name = "linksSpy"
allowed_domains = ["some_website"]
start_urls = ["some_website"]
rules = (Rule(LinkExtractor(allow=()), callback='parse_obj', follow=True),)
def parse_obj(self, response):
items = []
for link in LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
item = ExtractlinksItem()
for sel in response.xpath('//tr/td/a'):
item['title'] = sel.xpath('/text()').extract()
item['link'] = sel.xpath('/@href').extract()
items.append(item)
return items
cmdline.execute("scrapy crawl linksSpy".split())
My pipelines.py is:
import csv
class ExtractlinksPipeline(object):
def __init__(self):
self.csvwriter = csv.writer(open('Links.csv', 'wb'))
def process_item(self, item, spider):
self.csvwriter.writerow((item['title'][0]), item['link'][0])
return item
My items.py is:
import scrapy
class ExtractlinksItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
link = scrapy.Field()
pass
I have also changed my settings.py:
ITEM_PIPELINES = {'extractLinks.pipelines.ExtractlinksPipeline': 1}
Upvotes: 1
Views: 7311
Reputation: 21446
To output all data scrapy has inbuilt feature called Feed Exports.
To put it shortly all you need is two settings in your settings.py
file: FEED_FORMAT
- format in which the feed should be saved, in your case csv and FEED_URI
- location where the feed should be saved, e.g. ~/my_feed.csv
My related answer covers it in greater detail with a use case:
https://stackoverflow.com/a/41473241/3737009
Upvotes: 1