Scraping iTunes Charts using Scrapy

Question

I am doing the following tutorial on using Scrapy to scrape iTunes charts. http://davidwalsh.name/python-scrape

The tutorial is slightly outdated, in that some of the syntaxes used have been deprecated in the current version of Scrapy (e.g. HtmlXPathSelector, BaseSpider..) - I have been working on completing the tutorial with the current version of Scrapy, but to no success.

If anyone knows what I'm doing incorrectly, would love to understand what I need to work on.

items.py

from scrapy.item import Item, Field

class AppItem(Item):
    app_name = Field()
    category = Field()
    appstore_link = Field()
    img_src = Field()

apple_spider.py

import scrapy
from scrapy.selector import Selector

from apple.items import AppItem

class AppleSpider(scrapy.Spider):
    name = "apple"
    allowed_domains = ["apple.com"]
    start_urls = ["http://www.apple.com/itunes/charts/free-apps/"]

    def parse(self, response):
        apps = response.selector.xpath('//*[@id="main"]/section/ul/li')
        count = 0
        items = []

        for app in apps:

            item = AppItem()
            item['app_name'] = app.select('//h3/a/text()')[count].extract()
            item['appstore_link'] = app.select('//h3/a/@href')[count].extract()
            item['category'] = app.select('//h4/a/text()')[count].extract()
            item['img_src'] = app.select('//a/img/@src')[count].extract()

            items.append(item)
            count += 1

        return items

This is my console message after running scrapy crawl apple:

2015-02-10 13:38:12-0500 [scrapy] INFO: Scrapy 0.24.4 started (bot: apple)
2015-02-10 13:38:12-0500 [scrapy] INFO: Optional features available: ssl, http11, django
2015-02-10 13:38:12-0500 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'apple.spiders', '
SPIDER_MODULES': ['apple.spiders'], 'BOT_NAME': 'apple'}
2015-02-10 13:38:12-0500 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, We
bService, CoreStats, SpiderState
2015-02-10 13:38:13-0500 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, Download
TimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddle
ware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, D
ownloaderStats
2015-02-10 13:38:13-0500 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMidd
leware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-02-10 13:38:13-0500 [scrapy] INFO: Enabled item pipelines:
2015-02-10 13:38:13-0500 [apple] INFO: Spider opened
2015-02-10 13:38:13-0500 [apple] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items
/min)
2015-02-10 13:38:13-0500 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-02-10 13:38:13-0500 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-02-10 13:38:13-0500 [apple] DEBUG: Crawled (200)  (referer: None)
2015-02-10 13:38:13-0500 [apple] INFO: Closing spider (finished)
2015-02-10 13:38:13-0500 [apple] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 236,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 13148,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 2, 10, 18, 38, 13, 271000),
         'log_count/DEBUG': 3,
         'log_count/INFO': 7,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2015, 2, 10, 18, 38, 13, 240000)}
2015-02-10 13:38:13-0500 [apple] INFO: Spider closed (finished)

Thanks in advance for any help/advice!

alecxe · Accepted Answer

Before reading the technical part: make sure you are not violating the iTunes terms of use.

All of the problems you have are inside the parse() callback:

the main xpath is not correct (there is no ul element directly under the section)
instead of response.selector you can directly use response
the xpath expressions in the loop should be context-specific

The fixed version:

def parse(self, response):
    apps = response.xpath('//*[@id="main"]/section//ul/li')

    for app in apps:
        item = AppItem()
        item['app_name'] = app.xpath('.//h3/a/text()').extract()
        item['appstore_link'] = app.xpath('.//h3/a/@href').extract()
        item['category'] = app.xpath('.//h4/a/text()').extract()
        item['img_src'] = app.xpath('.//a/img/@src').extract()

        yield item

Scraping iTunes Charts using Scrapy

Answers (1)

Related Questions