Scraping Multiple Pages Scrapy

Question

I'm trying to scrape every year of the top Billboard top 100. I have a file that works for one year at a time, but I want it to crawl through all years and gather that data as well. Here is my current code:

from scrapy import Spider
from scrapy.selector import Selector
from Billboard.items import BillboardItem
from scrapy.exceptions import CloseSpider
from scrapy.http import Request

URL = "http://www.billboard.com/archive/charts/%/hot-100"

class BillboardSpider(Spider):
    name = 'Billboard_spider'
    allowed_urls = ['http://www.billboard.com/']
    start_urls = [URL % 1958]

def _init_(self):
            self.page_number=1958

def parse(self, response):
            print self.page_number
            print "----------"

    rows = response.xpath('//*[@id="block-system-main"]/div/div/div[2]/table/tbody/tr').extract()

    for row in rows:
        IssueDate = Selector(text=row).xpath('//td[1]/a/span/text()').extract()
        Song = Selector(text=row).xpath('//td[2]/text()').extract()
        Artist = Selector(text=row).xpath('//td[3]/a/text()').extract()


        item = BillboardItem()
        item['IssueDate'] = IssueDate
        item['Song'] = Song
        item['Artist'] = Artist


        yield item
            self.page_number += 1
            yield Request(URL % self.page_number)

but I'm getting error: "start_urls = [URL % 1958] ValueError: unsupported format character '/' (0x2f) at index 41"

Any ideas? I want the code to change the year to 1959 automatically from the original "URL" link, and keep going year by year until it stops finding the table, and then close out.

sxn · Accepted Answer

The error you're getting is because you're not using the correct syntax for string formatting. You can have a look here for details on how it works. The reason it doesn't work in your particular case is that your URL is missing an 's':

URL = "http://www.billboard.com/archive/charts/%/hot-100"

should be

URL = "http://www.billboard.com/archive/charts/%s/hot-100"

Anyway it's better to use new style string formatting:

URL = "http://www.billboard.com/archive/charts/{}/hot-100"
start_urls = [URL.format(1958)]

Moving on, your code has some other problems:

def _init_(self):
    self.page_number=1958

if you want to use an init function, it should be named __init__ (two underscores) and because you're extending Spider, you need to pass *args and **kwargs so you can call the parent constructor:

def __init__(self, *args, **kwargs):
    super(MySpider, self).__init__(*args, **kwargs)
    self.page_number = 1958

it sounds like you might be better off not using __init__ and instead just using a list comprehension to generate all the urls from the get go:

start_urls = ["http://www.billboard.com/archive/charts/{year}/hot-100".format(year=year) 
                  for year in range(1958, 2017)]

start_urls will then look like this:

['http://www.billboard.com/archive/charts/1958/hot-100',
 'http://www.billboard.com/archive/charts/1959/hot-100',
 'http://www.billboard.com/archive/charts/1960/hot-100',
 'http://www.billboard.com/archive/charts/1961/hot-100',
 ...
 'http://www.billboard.com/archive/charts/2017/hot-100']

you're also not populating your BillboardItem correctly, as objects don't (by default) support item assignment:

 item = BillboardItem()
 item['IssueDate'] = IssueDate
 item['Song'] = Song
 item['Artist'] = Artist

should be:

item = BillboardItem()
item.IssueDate = IssueDate
item.Song = Song
item.Artist = Artist

although it's generally better to just do that in the class' init function: class BillboardItem(object): def init(self, issue_date, song, artist): self.issue_date = issue_date self.song = song self.artist = artist and then create the item by item = BillboardItem(IssueDate, Song, Artist)

Updated

Anyway, I cleaned up your code (and created a BillboardItem as I don't exactly know how yours looks):

from scrapy import Spider, Item, Field
from scrapy.selector import Selector
from scrapy.exceptions import CloseSpider
from scrapy.http import Request


class BillboardItem(Item):
    issue_date = Field()
    song = Field()
    artist = Field()


class BillboardSpider(Spider):
    name = 'billboard'
    allowed_urls = ['http://www.billboard.com/']
    start_urls = ["http://www.billboard.com/archive/charts/{year}/hot-100".format(year=year)
            for year in range(1958, 2017)]


    def parse(self, response):
        print(response.url)
        print("----------")

        rows = response.xpath('//*[@id="block-system-main"]/div/div/div[2]/table/tbody/tr').extract()

        for row in rows:
            issue_date = Selector(text=row).xpath('//td[1]/a/span/text()').extract()
            song = Selector(text=row).xpath('//td[2]/text()').extract()
            artist = Selector(text=row).xpath('//td[3]/a/text()').extract()

            item = BillboardItem(issue_date=issue_date, song=song, artist=artist)

            yield item

Hope this helps. :)

Scraping Multiple Pages Scrapy

Answers (1)

Updated

Related Questions