Reputation: 155
Trying to launch scrapy from a py file with this command :
py myproject.py -f C:\Users\admin\Downloads\test.csv
Here my file named "myproject.py"
import spiders.ggspider as MySpiders
# Return array
dataFile = args.file
myData = CSVReader.getAnimalList(dataFile)
leSpider = MySpiders.GGCSpider()
leSpider.myList = myData
leSpider.start_requests()
Here my spider file :
import scrapy
import urllib
class GGSpider(scrapy.Spider):
name = "spiderman"
domain = "https://www.google.fr/?q={}"
myList = []
def __init__(self):
pass
def start_requests(self):
for leObject in self.myList:
tmpURL = self.domain.format(urllib.parse.urlencode({'text' : leObject[0]}))
yield scrapy.Request(url=self.domain+leObject[0],callback = self.parse)
def parse(self, response):
print('hello')
print(response)
My problem is : I go into start_requests, because I put a print before the yield and got the print in console But the callback seems to not append (I don't get the 'Hello' print).
I really don't know why (I'm new to Python, maybe I'm missing something obvious)
Upvotes: 0
Views: 354
Reputation: 1379
I guess that's because generator doesn't actually runs before you'll retrieve its values. You could try to consume generator somehow:
import spiders.ggspider as MySpiders
# Return array
dataFile = args.file
myData = CSVReader.getAnimalList(dataFile)
leSpider = MySpiders.GGCSpider()
leSpider.myList = myData
for request in leSpider.start_requests():
do_something(request)
UPD: Here is a better example of running Spider from a script:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {"format": "json"},
},
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
Upvotes: 1