Reputation: 154
I am trying to scrape some info from the companieshouse of the UK using scrapy. I made a connection with the website through the shell and throught he command
scrapy shell https://beta.companieshouse.gov.uk/search?q=a
and with
response.xpath('//*[@id="results"]').extract()
I managed to get the results back.
I tried to put this into a program so i could export it to a csv or json. But I am having trouble getting it to work.. This is what i got;
import scrapy
class QuotesSpider(scrapy.Spider):
name = "gov2"
def start_requests(self):
start_urls = ['https://beta.companieshouse.gov.uk/search?q=a']
def parse(self, response):
products = response.xpath('//*[@id="results"]').extract()
print(products)
Very simple but tried a lot. Any insight would be appreciated!!
Upvotes: 0
Views: 65
Reputation: 121
Try to do:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "gov2"
start_urls = ["https://beta.companieshouse.gov.uk/search?q=a"]
def parse(self, response):
products = response.xpath('//*[@id="results"]').extract()
print(products)
Upvotes: 0
Reputation: 28266
These lines of code are the problem:
def start_requests(self):
start_urls = ['https://beta.companieshouse.gov.uk/search?q=a']
The start_requests
method should return an iterable of Request
s; yours returns None
.
The default start_requests
creates this iterable from urls specified in start_urls
, so simply defining that as a class variable (outside of any function) and not overriding start_requests
will work as you want.
Upvotes: 2