Scrapy using SgmlLinkExtractor

Question

I am trying to crawl pages of the form http://www.wynk.in/music/song/variable_underscored_alphanumeric_string.html. I want to hit such URLS from laptop, but since the urls only work on apps and WAPs, I have given user agent as 'Mozilla/5.0 (Linux; U; Android 2.3.4; fr-fr; HTC Desire Build/GRJ22) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1' in settings.py. My code file reads

from scrapy import Selector
from wynks.items import WynksItem

from scrapy.contrib.spiders import CrawlSpider, Rule

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor


class MySpider(CrawlSpider):

name = "wynk"
#allowed_domains = ["wynk.in"]
start_urls = ["http://www.wynk.in/", ]
#start_urls = []
rules = (Rule(SgmlLinkExtractor(allow=[r'/music/song/\w+.html']), callback='parse_item', follow=True),)

def parse_item(self, response):
    hxs = Selector(response)
    if hxs:
        tds = hxs.xpath("//div[@class='songDetails']//tr//td")
        if tds:
            for td in tds.xpath('.//div'):
                titles = td.xpath("a/text()").extract()
                if titles:
                    for title in titles:
                        print title

I start the code by running scrapy crawl wynk -o abcd.csv -t csv

However, I only get this result Crawled (200) http://www.wynk.in/> (referer: None) 2015-03-23 11:06:04+0530 [wynk] INFO: Closing spider (finished) What am I doing wrong?

notrai · Accepted Answer

Since there is no direct link to aforementioned URL on homepage, worked around by fetching all links, and recursively visiting music/song pages by creating recursive requests. Changed inheritance to inherit from Spider instead of CrawlSpider

Scrapy using SgmlLinkExtractor

Answers (1)

Related Questions