Reputation: 23
I have two spiders. Let's say A and B. A scrapes bunch of urls and writes it into a csv file and B scrapes inside those urls reading from the csv file generated by A. But it throws FileNotFound error from B before A can actually create the file. How can I make my spiders behave such that B waits until A comes back with url? Any other solution would be helpful.
WriteToCsv.py file
def write_to_csv(item):
with open('urls.csv', 'a', newline='') as csvfile:
fieldnames = ['url']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'url': item})
class WriteToCsv(object):
def process_item(self, item, spider):
if item['url']:
write_to_csv("http://pypi.org" +item["url"])
return item
Pipelines.py file
ITEM_PIPELINES = {
'PyPi.WriteToCsv.WriteToCsv': 100,
'PyPi.pipelines.PypiPipeline': 300,
}
read_csv method
def read_csv():
x = []
with open('urls.csv', 'r') as csvFile:
reader = csv.reader(csvFile)
for row in reader:
x = [''.join(url) for url in reader]
return x
start_urls in B spider file
start_urls = read_csv() #Error here
Upvotes: 0
Views: 761
Reputation: 378
I would consider using a single spider with two methods parse
and final_parse
. As far as I can tell from the context you have provided there is no need to write the URLs to disk.
parse
should contain the logic for scraping the URLs that spider A is currently writing to the csv and should return a new request with a callback to the final_parse
method.
def parse(self, response):
url = do_something(response.body_as_unicode())
return scrapy.Request(url, callback=self.final_parse)
final_parse
should then contain the parsing logic that was previously in spider B.
def final_parse(self, response):
item = do_something_else(response.body_as_unicode())
return item
Note: If you need to pass any additional information from parse
to final_parse
you can use the meta
argument of scrapy.Request
.
If you do need the URLs, you could add this as a field to your item.
It can be accessed with response.url
.
Upvotes: 1