Reputation: 359
How can I crawl using scrapy to each one of the href? I just know how to display it all but i want to be able to go into each of those links. This is our intranet data so you wont be able to access the links. Also how can I format the date when the data gets display in a file? Do i need to add a list of urls in the start_url? Do i need to change my initSpider to crawlSpider?
<row>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">14256238845</cell>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100">353918053831794</cell>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100">310260548400764</cell>
<cell type="href" href="/dis/packages.jsp?view=timeline&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100&date=20130423T020032243">2013-04-23 02:00:32.243</cell>
<cell type="plain">2013-04-23 02:00:32.243</cell>
<cell type="plain">3 - PackageCreation</cell>
<cell type="href" href="/dis/profile_download?profileId=400006">400006</cell>
<cell type="href" href="/dis/sessions.jsp?view=list&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">view sessions</cell>
<cell type="href" href="/dis/errors_agg.jsp?view=list&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">view errors</cell>
</row>
This is what I have so far it prints everything
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import XmlXPathSelector
from carrier.items import CarrierItem
class CarrierSpider(InitSpider):
name = 'dis'
allowed_domains = ['qvpweb01.ciq.labs.att.com']
login_page = 'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'
start_urls = ["https://qvpweb01.ciq.labs.att.com:8080/dis/"]
def init_request(self):
#"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)
def login(self, response):
#"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'txtUserName': 'myuser', 'txtPassword': 'xxxx'},
callback=self.check_login_response)
def check_login_response(self, response):
#"""Check the response returned by a login request to see if we aresuccessfully logged in."""
if "logout" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
# Now the crawling can begin..
return self.initialized()
else:
self.log("\n\n\nFailed, Bad password :(\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens.
def parse(self, response):
xhs = XmlXPathSelector(response)
columns = xhs.select('//table[3]/row/cell')
for column in columns:
item = CarrierItem()
item['title'] = column.select('.//text()').extract()
item['link'] = column.select('.//@href').extract()
yield item
output i get from csv file below:
14256238845
3.53918E+14
3.10261E+14
00:32.2
00:32.2
3 - PackageCreation
400006
view sessions
view errors
desire output from csv that i would like to get below:
14256238845
353918053831794
310260548400764
2013-04-23 02:00:32.243
2013-04-23 02:00:32.243
3 - PackageCreation
400006
view sessions
view errors
Upvotes: 2
Views: 700
Reputation: 4062
Whenever you want to follow an URL you can yield a Request object.
Eg: yield Request(extracted_url_link, callback=your_parse_function)
Look at the second example in the following link.
http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example
One more way to specify the crawl url is to use SgmlLinkExtractor. You can write rules. Spider will crawl all the urls in any page, if the rule is matched. Refer the example in the following url.
http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider
Date is just a string after crawl, you can convert it to python datetime object and then display it the way you want using datetime rendering functions like strftime.
Hope I answered your question.
Upvotes: 1