Reputation: 165
I'm trying to get companies' info from government website and using Scrapy for it.My spider code is the following one.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider
from ..items import CompaniesHouseItem
class SpendolaterSpider(scrapy.Spider):
name = 'spendolater'
allowed_domains = ['beta.companieshouse.gov.uk']
start_url = ['https://beta.companieshouse.gov.uk/company/10511127']
custom_settings = {"DOWNLOAD_DELAY": 1,}
def crawling(self, response):
domain = "https://beta.companieshouse.gov.uk/company/"
for url in response.css("a::attr('href')").extract():
if not url.startswith('https://'):
continue
if domain not in url:
yield scrapy.Request(url, callback=self.parse)
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_item(self, response):
for contents in response.xpath('//*[@id="page-container"]'):
item = CompaniesHouseItem()
item["name"] = response.xpath('//*[@id="company-name"]').extract()
item["location"] = response.xpath('//*[@id="content-container"]/dl/dd').extract()
item['foundation'] = response.xpath('//*[@id="company-creation-date"]').extract()
items['type'] = response.xpath('//*[@id="company-type"]').extract()
items['SIC'] = response.xpath('//*[@id="sic0"]').extract()
yield item
It doesn't show any error when running but doesn't extract any info. "Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)" message is shown in command line after running.
"items.py" file is as follows
import scrapy
class CompaniesHouseItem(scrapy.Item):
name = scrapy.Field()
location = scrapy.Field()
foundation = scrapy.Field()
type = scrapy.Field()
SIC = scrapy.Field()
Output is as follows.
2018-03-14 17:51:56 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: companies_house)
2018-03-14 17:51:56 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:04:45) [MSC v.1900 32 bit (Intel)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g 2 Nov 2017), cryptography 2.1.4, Platform Windows-10-10.0.16299-SP0
2018-03-14 17:51:56 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'companies_house', 'DOWNLOAD_DELAY': 1, 'NEWSPIDER_MODULE': 'companies_house.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['companies_house.spiders']}
2018-03-14 17:51:57 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-03-14 17:51:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-03-14 17:51:57 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-03-14 17:51:57 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-03-14 17:51:57 [scrapy.core.engine] INFO: Spider opened
2018-03-14 17:51:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-03-14 17:51:57 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-03-14 17:51:57 [scrapy.core.engine] INFO: Closing spider (finished)
2018-03-14 17:51:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 3, 14, 8, 51, 57, 239817),
'log_count/DEBUG': 1,
'log_count/INFO': 7,
'start_time': datetime.datetime(2018, 3, 14, 8, 51, 57, 231826)}
2018-03-14 17:51:57 [scrapy.core.engine] INFO: Spider closed (finished)
Any advice would be highly appreciated. Thanks in advance.
Upvotes: 0
Views: 1991
Reputation: 21221
You dont have def start_requets(self)
but start_url
so Scrapy will scrape URLs from list start_url
and use your callback method parse
.
I mean, you are missing def parse(self, response)
change def crawling(self, response)
to def parse(self, response)
Also your code's logic is completely wrong, just think of code's flow before writing it.
Put the page in start_url
which has company links, listing page i mean
Then create def parse(self, response)
and create a for loop to iterate over each company link.
Upvotes: 0
Reputation: 1908
Because Scrapy, by default, read the first addresses to scrape in start_urls
(not start_url
) and start parsing with parse
method (not crawling
). Try a rename operation and relaunch your spider.
Upvotes: 2