Gio
Gio

Reputation: 359

How toI crawl using scrapy to each one of the href

How can I crawl using scrapy to each one of the href? I just know how to display it all but i want to be able to go into each of those links. This is our intranet data so you wont be able to access the links. Also how can I format the date when the data gets display in a file? Do i need to add a list of urls in the start_url? Do i need to change my initSpider to crawlSpider?

<row>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">14256238845</cell>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100">353918053831794</cell>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100">310260548400764</cell>
<cell type="href" href="/dis/packages.jsp?view=timeline&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100&date=20130423T020032243">2013-04-23 02:00:32.243</cell>
<cell type="plain">2013-04-23 02:00:32.243</cell>
<cell type="plain">3 - PackageCreation</cell>
<cell type="href" href="/dis/profile_download?profileId=400006">400006</cell>
<cell type="href" href="/dis/sessions.jsp?view=list&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">view sessions</cell>
<cell type="href" href="/dis/errors_agg.jsp?view=list&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">view errors</cell>
</row>

This is what I have so far it prints everything

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from scrapy.selector import XmlXPathSelector

from carrier.items import CarrierItem

class CarrierSpider(InitSpider):
    name = 'dis'
    allowed_domains = ['qvpweb01.ciq.labs.att.com']
    login_page = 'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'
    start_urls = ["https://qvpweb01.ciq.labs.att.com:8080/dis/"]

    def init_request(self):
    #"""This function is called before crawling starts."""
    return Request(url=self.login_page, callback=self.login)

    def login(self, response):
    #"""Generate a login request."""
    return FormRequest.from_response(response,
            formdata={'txtUserName': 'myuser', 'txtPassword': 'xxxx'},
            callback=self.check_login_response)

    def check_login_response(self, response):
    #"""Check the response returned by a login request to see if we aresuccessfully logged in."""
    if "logout" in response.body:
        self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
        # Now the crawling can begin..

        return self.initialized() 

    else:
        self.log("\n\n\nFailed, Bad password :(\n\n\n")
        # Something went wrong, we couldn't log in, so nothing happens.


    def parse(self, response):
    xhs = XmlXPathSelector(response)
    columns = xhs.select('//table[3]/row/cell')
    for column in columns:
        item = CarrierItem()
        item['title'] = column.select('.//text()').extract()
        item['link'] = column.select('.//@href').extract()
        yield item

output i get from csv file below:

14256238845
3.53918E+14
3.10261E+14
00:32.2
00:32.2
3 - PackageCreation
400006
view sessions
view errors

desire output from csv that i would like to get below:

14256238845
353918053831794
310260548400764
2013-04-23 02:00:32.243
2013-04-23 02:00:32.243
3 - PackageCreation
400006
view sessions
view errors

Upvotes: 2

Views: 700

Answers (1)

nizam.sp
nizam.sp

Reputation: 4062

Whenever you want to follow an URL you can yield a Request object.
Eg: yield Request(extracted_url_link, callback=your_parse_function)

Look at the second example in the following link.
http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example

One more way to specify the crawl url is to use SgmlLinkExtractor. You can write rules. Spider will crawl all the urls in any page, if the rule is matched. Refer the example in the following url.
http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

Date is just a string after crawl, you can convert it to python datetime object and then display it the way you want using datetime rendering functions like strftime.

Hope I answered your question.

Upvotes: 1

Related Questions