Recursive Scraping with Python and Scrapy: Information Not Retrieved

Question

I am trying to use Scrapy to pull contact information from the Pratt website, but information is not being retrieved. My code is as follows:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector, Selector
from scrapy.http import Request

class ESpider(CrawlSpider):
    name = "pratt"
    allowed_domains = ["pratt.edu"]
    start_urls = ["https://www.pratt.edu/academics/architecture/ug_dept_architecture/faculty_and_staff/?id=01302"]

    rules = (Rule (SgmlLinkExtractor(restrict_xpaths=('/html/body/div[3]/div/div[2]/div/div/p/a',))
    , callback="parse_items", follow= True),
    )

    def parse_items(self, response):
        contacts = Selector(response)
        print contacts.xpath('/html/body/div[3]/div/div[2]/table/tbody/tr[2]/td[2]/h3').extract()
        print contacts.xpath('/html/body/div[3]/div/div[2]/table/tbody/tr[2]/td[2]/a').extract()

Beginning on my start_url, I want to go through each person's link and grab their name and email address from the next page. When I run my scraper, I receive the following output:

2014-03-10 16:46:37-0400 [scrapy] INFO: Scrapy 0.22.2 started (bot: emailSpider)
2014-03-10 16:46:37-0400 [scrapy] INFO: Optional features available: ssl, http11
2014-03-10 16:46:37-0400 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'emailSpider.spiders', 'SPIDER_MODULES': ['emailSpider.spiders'], 'DEPTH_LIMIT': 1, 'BOT_NAME': 'emailSpider'}
2014-03-10 16:46:37-0400 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-03-10 16:46:37-0400 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-03-10 16:46:37-0400 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-03-10 16:46:37-0400 [scrapy] INFO: Enabled item pipelines: 
2014-03-10 16:46:37-0400 [pratt] INFO: Spider opened
2014-03-10 16:46:37-0400 [pratt] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-03-10 16:46:37-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-03-10 16:46:37-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-03-10 16:46:41-0400 [pratt] DEBUG: Crawled (200)  (referer: None)
2014-03-10 16:46:44-0400 [pratt] DEBUG: Crawled (200)  (referer: https://www.pratt.edu/academics/architecture/ug_dept_architecture/faculty_and_staff/?id=01302)
[]
[]
2014-03-10 16:46:47-0400 [pratt] DEBUG: Crawled (200)  (referer: https://www.pratt.edu/academics/architecture/ug_dept_architecture/faculty_and_staff/?id=01302)
[]
[]

(I halted the program after a few iterations). It looks like the pages are being scraped, but empty lists are being returned. Any idea as to why this is going on? Thanks very much.

Recursive Scraping with Python and Scrapy: Information Not Retrieved

Answers (1)

Related Questions