Reputation: 10043
I'm running Scrapy from a Python script.
I was told that in Scrapy, response
s are built in parse()
and further processed in pipeline.py
.
This is how my framework is set so far:
Python script
def script(self):
process = CrawlerProcess(get_project_settings())
response = process.crawl('pitchfork_albums', domain='pitchfork.com')
process.start() # the script will block here until the crawling is finished
Spiders
class PitchforkAlbums(scrapy.Spider):
name = "pitchfork_albums"
allowed_domains = ["pitchfork.com"]
#creates objects for each URL listed here
start_urls = [
"http://pitchfork.com/reviews/best/albums/?page=1",
"http://pitchfork.com/reviews/best/albums/?page=2",
"http://pitchfork.com/reviews/best/albums/?page=3"
]
def parse(self, response):
for sel in response.xpath('//div[@class="album-artist"]'):
item = PitchforkItem()
item['artist'] = sel.xpath('//ul[@class="artist-list"]/li/text()').extract()
item['album'] = sel.xpath('//h2[@class="title"]/text()').extract()
yield item
items.py
class PitchforkItem(scrapy.Item):
artist = scrapy.Field()
album = scrapy.Field()
settings.py
ITEM_PIPELINES = {
'blogs.pipelines.PitchforkPipeline': 300,
}
pipelines.py
class PitchforkPipeline(object):
def __init__(self):
self.file = open('tracks.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
for i in item:
return i['album'][0]
If I just return item
in pipelines.py
, I get data like so (one response
for each html
page):
{'album': [u'Sirens',
u'I Had a Dream That You Were Mine',
u'Sunergy',
u'Skeleton Tree',
u'My Woman',
u'JEFFERY',
u'Blonde / Endless',
u' A Mulher do Fim do Mundo (The Woman at the End of the World) ',
u'HEAVN',
u'Blank Face LP',
u'blackSUMMERS\u2019night',
u'Wildflower',
u'Freetown Sound',
u'Trans Day of Revenge',
u'Puberty 2',
u'Light Upon the Lake',
u'iiiDrops',
u'Teens of Denial',
u'Coloring Book',
u'A Moon Shaped Pool',
u'The Colour in Anything',
u'Paradise',
u'HOPELESSNESS',
u'Lemonade'],
'artist': [u'Nicolas Jaar',
u'Hamilton Leithauser',
u'Rostam',
u'Kaitlyn Aurelia Smith',
u'Suzanne Ciani',
u'Nick Cave & the Bad Seeds',
u'Angel Olsen',
u'Young Thug',
u'Frank Ocean',
u'Elza Soares',
u'Jamila Woods',
u'Schoolboy Q',
u'Maxwell',
u'The Avalanches',
u'Blood Orange',
u'G.L.O.S.S.',
u'Mitski',
u'Whitney',
u'Joey Purp',
u'Car Seat Headrest',
u'Chance the Rapper',
u'Radiohead',
u'James Blake',
u'White Lung',
u'ANOHNI',
u'Beyonc\xe9']}
What I would like to do in pipelines.py
is to be able to fetch individual songs
for each item
, like so:
[u'Sirens']
Upvotes: 0
Views: 3750
Reputation: 2011
I suggest that you build well structured item
in spider. In Scrapy Framework work flow, spider is used to built well-formed item, e.g., parse html, populate item instances and pipeline is used to do operations on item, e.g., filter item, store item.
For your application, if I understand correctly, each item should be an entry to describe an album. So when paring html, you'd better build such kind of item, instead of massing everything into item.
So in your spider.py
, parse
function, you should
yield item
statement in the for
loop, NOT OUTSIDE. In this way, each album will generate an item..//
instead of //
, and to specify self, use ./
instead of /
.Ideally album title should be a scalar, album artist should be a list, so try extract_first
to make album title to be a scalar.
def parse(self, response):
for sel in response.xpath('//div[@class="album-artist"]'):
item = PitchforkItem()
item['artist'] = sel.xpath('./ul[@class="artist-list"]/li/text()').extract_first()
item['album'] = sel.xpath('./h2[@class="title"]/text()').extract()
yield item
Hope this would be helpful.
Upvotes: 3