Tuhina Singh
Tuhina Singh

Reputation: 1007

Python,Scrapy, Pipeline: function "process_item" not getting called

I have a very simple code, shown below. Scraping is okay, I can see all print statements generating correct data. In Pipeline,initialization is working fine. However, process_item function is not getting called, as print statement at the start of the function is never executed.

Spider: comosham.py

import scrapy
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from activityadvisor.items import ComoShamLocation
from activityadvisor.items import ComoShamActivity
from activityadvisor.items import ComoShamRates
import re


class ComoSham(Spider):
    name = "comosham"
    allowed_domains = ["www.comoshambhala.com"]
    start_urls = [
        "http://www.comoshambhala.com/singapore/classes/schedules",
        "http://www.comoshambhala.com/singapore/about/location-contact",
        "http://www.comoshambhala.com/singapore/rates-and-offers/rates-classes",
        "http://www.comoshambhala.com/singapore/rates-and-offers/rates-classes/rates-private-classes"
    ]

    def parse(self, response):  
        category = (response.url)[39:44]
        print 'in parse'
        if category == 'class':
            pass
            """self.gen_req_class(response)"""
        elif category == 'about':
            print 'about to call parse_location'
            self.parse_location(response)
        elif category == 'rates':
            pass
            """self.parse_rates(response)"""
        else:
            print 'Cant find appropriate category! check check check!! Am raising Level 5 ALARM - You are a MORON :D'


    def parse_location(self, response):
        print 'in parse_location'       
        item = ComoShamLocation()
        item['category'] = 'location'
        loc = Selector(response).xpath('((//div[@id = "node-2266"]/div/div/div)[1]/div/div/p//text())').extract()
        item['address'] = loc[2]+loc[3]+loc[4]+(loc[5])[1:11]
        item['pin'] = (loc[5])[11:18]
        item['phone'] = (loc[9])[6:20]
        item['fax'] = (loc[10])[6:20]
        item['email'] = loc[12]
        print item['address'],item['pin'],item['phone'],item['fax'],item['email']
        return item

Items file:

import scrapy
from scrapy.item import Item, Field

class ComoShamLocation(Item):
    address = Field()
    pin = Field()
    phone = Field()
    fax = Field()
    email = Field()
    category = Field()

Pipeline file:

class ComoShamPipeline(object):
    def __init__(self):
        self.locationdump = csv.writer(open('./scraped data/ComoSham/ComoshamLocation.csv','wb'))
        self.locationdump.writerow(['Address','Pin','Phone','Fax','Email'])


    def process_item(self,item,spider):
        print 'processing item now'
        if item['category'] == 'location':
            print item['address'],item['pin'],item['phone'],item['fax'],item['email']
            self.locationdump.writerow([item['address'],item['pin'],item['phone'],item['fax'],item['email']])
        else:
            pass

Upvotes: 13

Views: 8683

Answers (4)

Alberto
Alberto

Reputation: 1489

This solved my problem: I was Dropping all items before my Pipeline get called, so process_item() wasn't getting called but open_spider and close_spider was being called. So tmy solution was just change the order to use this Pipeline before the other Pipeline that Drops items.

Scrapy Pipeline Documentation.

Just remember that Scrapy calls Pipeline.process_item() only if there is item to process!

Upvotes: 1

atb00ker
atb00ker

Reputation: 1105

Adding to the answers above,
1. Remember to add the following line to settings.py! ITEM_PIPELINES = {'[YOUR_PROJECT_NAME].pipelines.[YOUR_PIPELINE_CLASS]': 300} 2. Yield the item when your spider runs! yield my_item

Upvotes: 0

Ganesh
Ganesh

Reputation: 867

Use ITEM_PIPELINES in settings.py:

ITEM_PIPELINES = ['project_name.pipelines.pipeline_class']

Upvotes: 5

rocktheartsm4l
rocktheartsm4l

Reputation: 2187

Your problem is that you are never actually yielding the item. parse_location returns an item to parse, but parse never yields that item.

The solution would be to replace:

self.parse_location(response)

with

yield self.parse_location(response)

More specifically, process_item never gets called if no items are yielded.

Upvotes: 15

Related Questions