Reputation: 188
I'm trying to scrape every year of the top Billboard top 100. I have a file that works for one year at a time, but I want it to crawl through all years and gather that data as well. Here is my current code:
from scrapy import Spider
from scrapy.selector import Selector
from Billboard.items import BillboardItem
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
URL = "http://www.billboard.com/archive/charts/%/hot-100"
class BillboardSpider(Spider):
name = 'Billboard_spider'
allowed_urls = ['http://www.billboard.com/']
start_urls = [URL % 1958]
def _init_(self):
self.page_number=1958
def parse(self, response):
print self.page_number
print "----------"
rows = response.xpath('//*[@id="block-system-main"]/div/div/div[2]/table/tbody/tr').extract()
for row in rows:
IssueDate = Selector(text=row).xpath('//td[1]/a/span/text()').extract()
Song = Selector(text=row).xpath('//td[2]/text()').extract()
Artist = Selector(text=row).xpath('//td[3]/a/text()').extract()
item = BillboardItem()
item['IssueDate'] = IssueDate
item['Song'] = Song
item['Artist'] = Artist
yield item
self.page_number += 1
yield Request(URL % self.page_number)
but I'm getting error: "start_urls = [URL % 1958] ValueError: unsupported format character '/' (0x2f) at index 41"
Any ideas? I want the code to change the year to 1959 automatically from the original "URL" link, and keep going year by year until it stops finding the table, and then close out.
Upvotes: 0
Views: 2927
Reputation: 508
The error you're getting is because you're not using the correct syntax for string formatting. You can have a look here for details on how it works. The reason it doesn't work in your particular case is that your URL is missing an 's':
URL = "http://www.billboard.com/archive/charts/%/hot-100"
should be
URL = "http://www.billboard.com/archive/charts/%s/hot-100"
Anyway it's better to use new style string formatting:
URL = "http://www.billboard.com/archive/charts/{}/hot-100"
start_urls = [URL.format(1958)]
Moving on, your code has some other problems:
def _init_(self):
self.page_number=1958
if you want to use an init function, it should be named __init__
(two underscores) and because you're extending Spider
, you need to pass *args
and **kwargs
so you can call the parent constructor:
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.page_number = 1958
it sounds like you might be better off not using __init__
and instead just using a list comprehension to generate all the urls from the get go:
start_urls = ["http://www.billboard.com/archive/charts/{year}/hot-100".format(year=year)
for year in range(1958, 2017)]
start_urls
will then look like this:
['http://www.billboard.com/archive/charts/1958/hot-100',
'http://www.billboard.com/archive/charts/1959/hot-100',
'http://www.billboard.com/archive/charts/1960/hot-100',
'http://www.billboard.com/archive/charts/1961/hot-100',
...
'http://www.billboard.com/archive/charts/2017/hot-100']
you're also not populating your BillboardItem
correctly, as objects don't (by default) support item assignment:
item = BillboardItem()
item['IssueDate'] = IssueDate
item['Song'] = Song
item['Artist'] = Artist
should be:
item = BillboardItem()
item.IssueDate = IssueDate
item.Song = Song
item.Artist = Artist
although it's generally better to just do that in the class' init function:
class BillboardItem(object):
def init(self, issue_date, song, artist):
self.issue_date = issue_date
self.song = song
self.artist = artist
and then create the item by item = BillboardItem(IssueDate, Song, Artist)
Anyway, I cleaned up your code (and created a BillboardItem as I don't exactly know how yours looks):
from scrapy import Spider, Item, Field
from scrapy.selector import Selector
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
class BillboardItem(Item):
issue_date = Field()
song = Field()
artist = Field()
class BillboardSpider(Spider):
name = 'billboard'
allowed_urls = ['http://www.billboard.com/']
start_urls = ["http://www.billboard.com/archive/charts/{year}/hot-100".format(year=year)
for year in range(1958, 2017)]
def parse(self, response):
print(response.url)
print("----------")
rows = response.xpath('//*[@id="block-system-main"]/div/div/div[2]/table/tbody/tr').extract()
for row in rows:
issue_date = Selector(text=row).xpath('//td[1]/a/span/text()').extract()
song = Selector(text=row).xpath('//td[2]/text()').extract()
artist = Selector(text=row).xpath('//td[3]/a/text()').extract()
item = BillboardItem(issue_date=issue_date, song=song, artist=artist)
yield item
Hope this helps. :)
Upvotes: 4