Reputation: 3352
i am trying to get data from amazon using scrapy i can get data in CSV but i am not able to insert data in mysql database please find my code my spider is
import scrapy
from craigslist_sample.items import AmazonDepartmentItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class AmazonAllDepartmentSpider(scrapy.Spider):
name = "amazon"
allowed_domains = ["amazon.com"]
start_urls = [
"http://www.amazon.com/gp/site-directory/ref=nav_sad/187-3757581-3331414"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = AmazonDepartmentItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
return item
my pipeline code is
import sys
import MySQLdb
import hashlib
from scrapy.exceptions import DropItem
from scrapy.http import Request
class MySQLStorePipeline(object):
host = 'rerhr.com'
user = 'amazon'
password = 'sads23'
db = 'amazon_project'
def __init__(self):
self.connection = MySQLdb.connect(self.host, self.user, self.password, self.db)
self.cursor = self.connection.cursor()
def process_item(self, item, spider):
try:
self.cursor.execute("""INSERT INTO amazon_project.ProductDepartment (ProductDepartmentLilnk)
VALUES (%s)""",
(
item['link'].encode('utf-8')))
self.connection.commit()
except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])
return item
when i run following command
scrapy crawl amazon -o items.csv -t csv
then i can able to get data in my CSV but when i run
scrapy crawl amazon
with above code i am not able to insert data in mysql
so please help me what we have to do then i can insert data in mysql
thanks
Upvotes: 1
Views: 6059
Reputation: 474271
The problem is actually inside the parse()
callback. extract()
calls return a list and, as a result, all item field values become lists. Then, item['link'].encode('utf-8')
call in the pipeline fails because there is no encode()
method on a list.
A quick and simple fix would be to get the first elements of the extract()
call results:
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = AmazonDepartmentItem()
item['title'] = sel.xpath('a/text()').extract()[0]
item['link'] = sel.xpath('a/@href').extract()[0]
item['desc'] = sel.xpath('text()').extract()[0]
yield item
Note that I've also replaced the return item
expression with a yield item
and put it inside the loop.
A better approach would be to define an ItemLoader
with input and output processors:
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst
class ProductLoader(ItemLoader):
default_output_processor = TakeFirst()
FYI, here is what TakeFirst()
does:
Returns the first non-null/non-empty value from the values received, so it’s typically used as an output processor to single-valued fields. It doesn’t receive any constructor arguments, nor accept Loader contexts.
Then, the parse()
method would transform into:
def parse(self, response):
for sel in response.xpath('//ul/li'):
l = ItemLoader(item=AmazonDepartmentItem(), selector=sel)
l.add_xpath('title', 'a/text()')
l.add_xpath('link', 'a/@href')
l.add_xpath('desc', 'text()')
yield l.load_item()
Upvotes: 7