Smashed
Smashed

Reputation: 341

Passing scraped data in piplines __init__ scrapy for python

I am trying to pass the items that contain the title data to my piplines. Is there a way to this inside the parse because the data gets reset for the next page. I tried super(mySpider,self).__init__(*args,*kwargs) but data is not sent correctly. I need to get the title of the webpage as the filename so thats why I need the specific item in there.

Something like this.

   def __init__(self, item):

      self.csvwriter = csv.writer(open(item['title'][0]+'.csv', 'wb'), delimiter=',')
      self.csvwriter.writerow(['Name','Date','Location','Stars','Subject','Comment','Response','Title'])

Upvotes: 0

Views: 78

Answers (2)

William Kinaan
William Kinaan

Reputation: 28809

The input for any pipeline is your Item. In your case, you would need to pass the name (or any other data) in your Item. Then, you should write a pipeline to write that item to file system (or database or you can do what ever you want).

Sample code

Let's say your new pipeline is named 'NewPipeline' and is located inside the main root of your scrapy project.

In your setting, you would need to define that pipeline as this:

ITEM_PIPELINES = {
    'YourRootDirectory.NewPipleline.NewPipeline':800
#add any other pipelines you have
}

And your pipeline should be like this:

class NewPipeline(object):
    def process_item(self, item, spider):
        name = item['name']
        self.file = open("pathToWhereYouWantToSave"+ name, 'wb')
        line = json.dumps(dict(item)) #change the item to a json format in one line
        self.file.write(line)#write the item to the file

Note

You can put your pipeline in any other modules.

Upvotes: 2

GHajba
GHajba

Reputation: 3691

An ItemPipeline works a way different then you imagine.

If you look at the docs you can see:

After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially.

This means that your header you pass along with an item arrives at the pipeline only with that one item. And the order of the items is not guaranteed per default so you cannot expect that one item arrives as the first one to the pipeline to set the header.

An alternative way is to mark this specific item and look for it in your pipeline. If it did not arrive, store the items until it arrives, write the title and write the stored items. From now on you can write the items to your CSV file. Another alternative is to write the items only when the spider finished crawling.

However I am wondering why isn't your exported headers fixed for the Spider you use... But nevertheless this can happen.

Upvotes: 1

Related Questions