Processing items with Scrapy pipeline

Question

I'm running Scrapy from a Python script.

I was told that in Scrapy, responses are built in parse()and further processed in pipeline.py.

This is how my framework is set so far:

Python script

def script(self):

        process = CrawlerProcess(get_project_settings())

        response = process.crawl('pitchfork_albums', domain='pitchfork.com')

        process.start() # the script will block here until the crawling is finished

Spiders

class PitchforkAlbums(scrapy.Spider):
    name = "pitchfork_albums"
    allowed_domains = ["pitchfork.com"]
    #creates objects for each URL listed here
    start_urls = [
                    "http://pitchfork.com/reviews/best/albums/?page=1",
                    "http://pitchfork.com/reviews/best/albums/?page=2",
                    "http://pitchfork.com/reviews/best/albums/?page=3"                   
    ]
    def parse(self, response):

        for sel in response.xpath('//div[@class="album-artist"]'):
            item = PitchforkItem()
            item['artist'] = sel.xpath('//ul[@class="artist-list"]/li/text()').extract()
            item['album'] = sel.xpath('//h2[@class="title"]/text()').extract()

        yield item

items.py

class PitchforkItem(scrapy.Item):

    artist = scrapy.Field()
    album = scrapy.Field()

settings.py

ITEM_PIPELINES = {
   'blogs.pipelines.PitchforkPipeline': 300,
}

pipelines.py

class PitchforkPipeline(object):

    def __init__(self):
        self.file = open('tracks.jl', 'wb')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "
"
        self.file.write(line)
        for i in item:
            return i['album'][0]

If I just return item in pipelines.py, I get data like so (one response for each htmlpage):

{'album': [u'Sirens',
           u'I Had a Dream That You Were Mine',
           u'Sunergy',
           u'Skeleton Tree',
           u'My Woman',
           u'JEFFERY',
           u'Blonde / Endless',
           u' A Mulher do Fim do Mundo (The Woman at the End of the World) ',
           u'HEAVN',
           u'Blank Face LP',
           u'blackSUMMERS\u2019night',
           u'Wildflower',
           u'Freetown Sound',
           u'Trans Day of Revenge',
           u'Puberty 2',
           u'Light Upon the Lake',
           u'iiiDrops',
           u'Teens of Denial',
           u'Coloring Book',
           u'A Moon Shaped Pool',
           u'The Colour in Anything',
           u'Paradise',
           u'HOPELESSNESS',
           u'Lemonade'],
 'artist': [u'Nicolas Jaar',
            u'Hamilton Leithauser',
            u'Rostam',
            u'Kaitlyn Aurelia Smith',
            u'Suzanne Ciani',
            u'Nick Cave & the Bad Seeds',
            u'Angel Olsen',
            u'Young Thug',
            u'Frank Ocean',
            u'Elza Soares',
            u'Jamila Woods',
            u'Schoolboy Q',
            u'Maxwell',
            u'The Avalanches',
            u'Blood Orange',
            u'G.L.O.S.S.',
            u'Mitski',
            u'Whitney',
            u'Joey Purp',
            u'Car Seat Headrest',
            u'Chance the Rapper',
            u'Radiohead',
            u'James Blake',
            u'White Lung',
            u'ANOHNI',
            u'Beyonc\xe9']}

What I would like to do in pipelines.py is to be able to fetch individual songs for each item, like so:

[u'Sirens']

rojeeer · Accepted Answer

I suggest that you build well structured item in spider. In Scrapy Framework work flow, spider is used to built well-formed item, e.g., parse html, populate item instances and pipeline is used to do operations on item, e.g., filter item, store item.

For your application, if I understand correctly, each item should be an entry to describe an album. So when paring html, you'd better build such kind of item, instead of massing everything into item.

So in your spider.py, parse function, you should

Put yield item statement in the for loop, NOT OUTSIDE. In this way, each album will generate an item.
Be careful about relative xpath selector in Scrapy. If you want to use relative xpath selector to specify self-and-descendant, use .// instead of //, and to specify self, use ./ instead of /.

Ideally album title should be a scalar, album artist should be a list, so try extract_first to make album title to be a scalar.

def parse(self, response):
for sel in response.xpath('//div[@class="album-artist"]'):
    item = PitchforkItem()
    item['artist'] = sel.xpath('./ul[@class="artist-list"]/li/text()').extract_first()
    item['album'] = sel.xpath('./h2[@class="title"]/text()').extract()
    yield item

Hope this would be helpful.

Processing items with Scrapy pipeline

Answers (1)

Related Questions