Reputation: 4274
I've used Firefox driver of Selenium to load and scrap web pages in some of spiders in my Scrapy project.
The problem:
Selenium runs an instance of Firefox when running all the spiders, event those I've not imported webdriver
and not called webdriver.Firefox()
in.
Expected behavior:
Selenium runs an instance of Firfox only when I run spiders that have been used webdriver.Firefox()
in.
Why this is important?
I'm quiting the Firefox instance after the spider is done, but vividly this is not happening in spiders not using Selenium.
The spider that is not using Selenium
This spider is not using Selenium and I expect it not to run Firefox.
class MySpider(scrapy.Spider):
name = "MySpider"
domain = 'www.example.com'
allowed_domains = ['http://example.com']
start_urls = ['http://example.com']
def parse(self, response):
for sel in response.css('.main-content'):
# Article is a scrapy.item
item = Article()
item['title'] = sel.css('h1::text').extract()[0]
item['body'] = sel.css('p::text').extract()[0]
yield item
Upvotes: 2
Views: 113
Reputation: 4274
The issue was actually in how I was instantiating webdriver.Firefox
module in spiders that were intended to use Selenium:
class MySpider(scrapy.Spider):
# basic scrapy setting
driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
result = scrapy.Selector(text=self.driver.page_source)
# scrap and yield items to pipeline
# then in certain condition:
self.driver.quit()
Why it was happening?
When running Scrapy commands, python interprets all the classes in project. so no matter which spider I was trying to run, Selenium ran a new instance of webdriver.Firefox
for each spider class containing this command line.
Solution
Just moved webdriver instantiation to class init method:
def __init__(self):
self.driver = webdriver.Firefox()
Upvotes: 2