Python Scrapy, parsing multiple child objects into the same item?

Question

For a non-profit college assignment I am trying to scrape the website www.rateyourmusic.com, I am able to scrape most things easily but I have encountered a problem when I am trying to scrape multiple children of a html element.

Specifically I'm trying to scrape the genre of an artist however many artists are multiple genres and I can't scrape all of them, here is my parsing method:

def parse_dir_contents(self, response): 

    item = rateyourmusicartist()

    #get the genres of the artist
    for sel in response.xpath('//a[@class="genre"]'):     
        item['genre'] = sel.xpath('text()').extract()

    yield item

there are usually multiple //a[@class="genre"] xpaths representing the genre, what I would like to do is put them all together in one string separated by ', '.

Is there an easy way to do this? here is a sample url for the site I'm scraping http://rateyourmusic.com/artist/kanye_west.

alecxe · Accepted Answer

A simple str.join() would do the trick:

", ".join(response.xpath('//a[@class="genre"]/text()').extract())

Demo (from the Scrapy Shell):

$ scrapy shell http://rateyourmusic.com/artist/kanye_west
In [1]: ", ".join(response.xpath('//a[@class="genre"]/text()').extract())
Out[1]: u'Hip Hop, Pop Rap, Experimental Hip Hop, Hardcore Hip Hop, Electropop, Synthpop'

Note that, if you were to use Item Loaders, you can make it much cleaner:

from scrapy.loader.processors import Join

loader = MyItemLoader(response=response)
loader.add_xpath("genre", '//a[@class="genre"]/text()', Join(", "))

yield loader.load_item()

Python Scrapy, parsing multiple child objects into the same item?

Answers (1)

Related Questions