Reputation: 2932
I'm writing a scrapy web crawler that saves the html from the pages that I visit and I'm uploading them to S3. Since they are uploading to S3, there's no point in keeping a local copy
Spider class
class MySpider(CrawlSpider):
name = 'my name'
start_urls = ['my url']
allowed_domains = ['my domain']
rules = (Rule (LinkExtractor(allow=()), callback="parse_item", follow= True),
)
def parse_item(self,response):
item = MyItem()
item['url'] = response.url
item['html'] = response.body
return item
pipelines.py
save_path = 'My path'
if not os.path.exists(save_path):
os.makedirs(save_path)
class HtmlFilePipeline(object):
def process_item(self, item, spider):
page = item['url'].split('/')[-1]
filename = '%s.html' % page
with open(os.path.join(save_path, filename), 'wb') as f:
f.write(item['html'])
self.UploadtoS3(filename)
def UploadtoS3(self, filename):
...
I read in the python docs that I can create a NamedTemporaryFile: https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile
I'm a little fuzzy on when it gets deleted. If I were to use a NamedTemporaryFile how could I delete the file after successful upload to S3?
Upvotes: 0
Views: 1140
Reputation: 852
Extending on my comment:
You could use the io.StringIO method to create a text buffer instead of saving/reading/deleting a file.
It would be something like this:
import io
if not os.path.exists(save_path):
os.makedirs(save_path)
class HtmlFilePipeline(object):
def process_item(self, item, spider):
page = item['url'].split('/')[-1]
filename = '%s.html' % page
file = io.StringIO()
file.write(item['html'])
self.UploadtoS3(filename, file)
def UploadtoS3(self, filename, file):
# here instead of reading the file to upload to S3, use the file passed to the method
Documentation: https://docs.python.org/3/library/io.html
Upvotes: 3