Eric
Eric

Reputation: 21

Scrapy issue with iTunes' AppStore

I am using Scrapy to fetch some data from iTunes' AppStore database. I start with this list of apps: http://itunes.apple.com/us/genre/mobile-software-applications/id36?mt=8

In the following code I have used the simplest regex which targets all apps in the US store.

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class AppStoreSpider(CrawlSpider):
    domain_name = 'itunes.apple.com'
    start_urls = ['http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8']

    rules = (
        Rule(SgmlLinkExtractor(allow='itunes\.apple\.com/us/app'),
            'parse_app', follow=True,
        ),
    )

def parse_app(self, response):
    ....

SPIDER = AppStoreSpider()

When I run it I receive the following:

 [itunes.apple.com] DEBUG: Crawled (200) <GET http://itunes.apple.com/us/genre/mobile-software-applications/id6015?mt=8> (referer: None)
 [itunes.apple.com] DEBUG: Filtered offsite request to 'itunes.apple.com': <GET http://itunes.apple.com/us/app/bloomberg/id281941097?mt=8>

As you can see, when it starts crawling the first page it says: "Filtered offsite request to 'itunes.apple.com'". and then the spider stops.. it also returns this message:

[ScrapyHTTPPageGetter,client] /usr/lib/python2.5/cookielib.py:1577: exceptions.UserWarning: cookielib bug!

Traceback (most recent call last): File "/usr/lib/python2.5/cookielib.py", line 1575, in make_cookies parse_ns_headers(ns_hdrs), request) File "/usr/lib/python2.5/cookielib.py", line 1532, in _cookies_from_attrs_set cookie = self._cookie_from_cookie_tuple(tup, request) File "/usr/lib/python2.5/cookielib.py", line 1451, in _cookie_from_cookie_tuple if version is not None: version = int(version) ValueError: invalid literal for int() with base 10: '"1"'

I have used the same script for other website and I didn't have this problem.

Any suggestion? 

Upvotes: 2

Views: 1335

Answers (2)

Inn0vative1
Inn0vative1

Reputation: 2145

I see this post is pretty old, if you haven't figured out the cause yet, here it is.

I run into a similar issue working with itunesconnect using mechanize. After much frustration i found that there's a bug in cookielib that doesn't handle some cookies correctly. It's discussed here: http://bugs.python.org/issue3924

The fix at the bottom of that post worked for me. I'll repost here for convenience.

Basically you create a custom subclass of cookielib.CookieJar, override _cookie_from_cookie_tuple and use this CustomCookieJar in place of the cookielib jar

class CustomCookieJar(cookielib.CookieJar):
    def _cookie_from_cookie_tuple(self, tup, request):
        name, value, standard, rest = tup
        version = standard.get("version", None)
        if version is not None:
            # Some servers add " around the version number, this module expects a pure int.
            standard["version"] = version.strip('"')
        return cookielib.CookieJar._cookie_from_cookie_tuple(self, tup,request) 

Upvotes: 1

ChrisCast
ChrisCast

Reputation: 208

When I hit that link in a browser, it automatically tries to open iTunes locally. That could be the "offsite request" mentioned in the error.

I would try:

1) Remove "?mt=8" from the end of the URL. It looks like it's not needed anyway and it could have something to do with the request.

2) Try the same request in the Scrapy Shell. It's a much easier way to debug your code and try new things. More details here: http://doc.scrapy.org/topics/shell.html?highlight=interactive

Upvotes: 1

Related Questions