Reputation: 2137
I am using scrapy for one of my project. The data gets scraped from spider and gets passed to pipeline for insertion into database. Here is my database class code:
import MySQLdb
class Database:
host = 'localhost'
user = 'root'
password = 'test123'
db = 'scraping_db'
def __init__(self):
self.connection = MySQLdb.connect(self.host, self.user, self.password, self.db,use_unicode=True, charset="utf8")
self.cursor = self.connection.cursor()
def insert(self, query,params):
try:
self.cursor.execute(query,params)
self.connection.commit()
except Exception as ex:
self.connection.rollback()
def __del__(self):
self.connection.close()
Here is my pipeline code that processes scraped items and saves into MySQL database.
from con import Database
class LinkPipeline(object):
def __init__(self):
self.db=Database()
def process_item(self, item, spider):
query="""INSERT INTO links (title, location,company_name,posted_date,status,company_id,scraped_link,content,detail_link,job_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s,%s)"""
params=(item['title'], item['location'], item['company_name'], item['posted_date'], item['status'], item['company_id'], item['scraped_link'], item['content'], item['detail_link'],item['job_id'])
self.db.insert(query,params)
return item
From above flow I feel whenever Item is processed via pipeline then a database connection is opened and closed when process_item is complete. This would open too much database connections. I want a way where my database connection is only opened once during the whole life cycle of spider and closed when spider is closed.
I read there are open_spider and close_spider method in Spider class, if I use them then how can I pass the reference to database connection from Spider's start_requests method to pipeline class?
Are there any better approaches to go about it?
Upvotes: 2
Views: 3093
Reputation: 21261
class MySpider(scrapy.Spider):
name = "myspidername"
host = 'localhost'
user = 'root'
password = 'test123'
db = 'scraping_db'
def __init__(self):
self.connection = MySQLdb.connect(self.host, self.user, self.password, self.db,use_unicode=True, charset="utf8")
self.cursor = self.connection.cursor()
def insert(self, query,params):
try:
self.cursor.execute(query,params)
self.connection.commit()
except Exception as ex:
self.connection.rollback()
def __del__(self):
self.connection.close()
then in your Pipeline do this spider.cursor
to access cursor
and perform any MySQL operation.
class LinkPipeline(object):
def process_item(self, item, spider):
query="""INSERT INTO links (title, location,company_name,posted_date,status,company_id,scraped_link,content,detail_link,job_id) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s,%s)"""
params=(item['title'], item['location'], item['company_name'], item['posted_date'], item['status'], item['company_id'], item['scraped_link'], item['content'], item['detail_link'],item['job_id'])
spider.cursor.insert(query,params)
return item
Upvotes: 4