Scrapy best practice

Question

I'm using scrapy to download large amount of data. I use default 16 concurrent requests. As a guide shows, I use pipelines method process_item to collect data at share variable. And at close_spider save data to SQL. If I load too large website, I lose all system memory. How should I avoid that problem?

Now I use one DB connection, that prepared at open_spider method and I could not use it in every process_item simultaneously.

Umair Ayub · Accepted Answer

Create a list of scraped items in your pipelines, and once that list's size is greater than N, then call the DB function to save data. Here is 100% working code from my project. See close_spider(), at the time of spider closed, there is a chance the self.items had less than N items in it, so any remaining data inside self.items list will also be saved in DB when spiders gets closed.

from scrapy import signals


class YourPipeline(object):
    def __init__(self):
        self.items = []


    def process_item(self, item, spider):
        self.items.extend([ item ])
        if len(self.items) >= 50:
            self.insert_current_items(spider)
        return item


    def insert_current_items(self, spider):
        for item in self.items:
            update_query = ', '.join(["`" + key + "` = %s " for key, value in item.iteritems()])
            query = "SELECT asin FROM " + spider.tbl_name + " WHERE asin = %s LIMIT 1"
            spider.cursor.execute(query, (item['asin']))
            existing = spider.cursor.fetchone()
            if spider.cursor.rowcount > 0:
                query = "UPDATE " + spider.tbl_name + " SET " + update_query + ", date_update = CURRENT_TIMESTAMP WHERE asin = %s"
                update_query_vals = list(item.values())
                update_query_vals.extend([existing['YOUR_UNIQUE_COLUMN']])
                try:
                    spider.cursor.execute(query, update_query_vals)
                except Exception as e:
                    if 'MySQL server has gone away' in str(e):
                        spider.connectDB()
                        spider.cursor.execute(query, update_query_vals)
                    else:
                        raise e
            else:
                # This ELSE is likely never to get executed because we are not scraping ASINS from Amazon website, we just import ASINs into DB from another script
                try:
                    placeholders = ', '.join(['%s'] * len(item))
                    columns = ', '.join(item.keys())
                    query = "INSERT INTO %s ( %s ) VALUES ( %s )" % (spider.tbl_name, columns, placeholders)
                    spider.cursor.execute(query, item)
                except Exception as e:
                    if 'MySQL server has gone away' in str(e):
                        spider.connectDB()
                        spider.cursor.execute(query, item)
                    else:
                        raise e
        self.items = []


    def close_spider(self, spider):
        self.insert_current_items(spider)

Scrapy best practice

Answers (1)

Related Questions