Bhetzie
Bhetzie

Reputation: 2932

Scrapy save html as temporary file

I'm writing a scrapy web crawler that saves the html from the pages that I visit and I'm uploading them to S3. Since they are uploading to S3, there's no point in keeping a local copy

Spider class

class MySpider(CrawlSpider):
    name = 'my name'  
    start_urls = ['my url']
    allowed_domains = ['my domain']
    rules = (Rule (LinkExtractor(allow=()), callback="parse_item", follow= True),
  )

    def parse_item(self,response): 
        item = MyItem()
        item['url'] = response.url
        item['html'] = response.body
        return item

pipelines.py

save_path = 'My path'

if not os.path.exists(save_path):
    os.makedirs(save_path)

class HtmlFilePipeline(object):
    def process_item(self, item, spider):
        page = item['url'].split('/')[-1]
        filename = '%s.html' % page
        with open(os.path.join(save_path, filename), 'wb') as f:
            f.write(item['html'])
        self.UploadtoS3(filename)

    def UploadtoS3(self, filename):
    ...

I read in the python docs that I can create a NamedTemporaryFile: https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile

I'm a little fuzzy on when it gets deleted. If I were to use a NamedTemporaryFile how could I delete the file after successful upload to S3?

Upvotes: 0

Views: 1140

Answers (1)

Henrique Coura
Henrique Coura

Reputation: 852

Extending on my comment:

You could use the io.StringIO method to create a text buffer instead of saving/reading/deleting a file.

It would be something like this:

import io

if not os.path.exists(save_path):
    os.makedirs(save_path)

class HtmlFilePipeline(object):
    def process_item(self, item, spider):
        page = item['url'].split('/')[-1]
        filename = '%s.html' % page
        file = io.StringIO()
        file.write(item['html'])
        self.UploadtoS3(filename, file)

    def UploadtoS3(self, filename, file):
        # here instead of reading the file to upload to S3, use the file passed to the method

Documentation: https://docs.python.org/3/library/io.html

Upvotes: 3

Related Questions