Reputation: 205
For a non-profit college assignment I am trying to scrape the website www.rateyourmusic.com, I am able to scrape most things easily but I have encountered a problem when I am trying to scrape multiple children of a html element.
Specifically I'm trying to scrape the genre of an artist however many artists are multiple genres and I can't scrape all of them, here is my parsing method:
def parse_dir_contents(self, response):
item = rateyourmusicartist()
#get the genres of the artist
for sel in response.xpath('//a[@class="genre"]'):
item['genre'] = sel.xpath('text()').extract()
yield item
there are usually multiple //a[@class="genre"]
xpaths representing the genre, what I would like to do is put them all together in one string separated by ', '.
Is there an easy way to do this? here is a sample url for the site I'm scraping http://rateyourmusic.com/artist/kanye_west.
Upvotes: 1
Views: 885
Reputation: 474171
A simple str.join()
would do the trick:
", ".join(response.xpath('//a[@class="genre"]/text()').extract())
Demo (from the Scrapy Shell):
$ scrapy shell http://rateyourmusic.com/artist/kanye_west
In [1]: ", ".join(response.xpath('//a[@class="genre"]/text()').extract())
Out[1]: u'Hip Hop, Pop Rap, Experimental Hip Hop, Hardcore Hip Hop, Electropop, Synthpop'
Note that, if you were to use Item Loaders, you can make it much cleaner:
from scrapy.loader.processors import Join
loader = MyItemLoader(response=response)
loader.add_xpath("genre", '//a[@class="genre"]/text()', Join(", "))
yield loader.load_item()
Upvotes: 1