WebOrCode
WebOrCode

Reputation: 7294

Scrapy, send scraped items in HTML email with custom formatting

Some background.
I am scraping page that has 25 adds per page.
If there is more than 25 adds that you have next button for next page and so on.
Each add can be opened separately, to see more information, but I am not doing that because all information that I need is on the page where adds are listed together.
I am making the program that will collect all adds from yesterday and then send them to email.
Basic idea is that person does not need to check new adds every day but adds will come to him (via email).

Question is how to this is Scrapy?

I have done scraping, that is working fine, the only thing left to do is to send those items to email with custom formatting.
By custom formatting, I meant that all adds have a price and I would like to order adds by price in the email that I will send.
I have few ideas but do not know what is right/best approach, so would like some feedback.
For somebody who already did this, he/she knows all the pitfalls.

Possible solutions:
1. export all to JSON file, then in another script import JSON file with pandas do ordering per price and send it to email.
- I do not like part of having another script and pandas just for ordering, but I also do not like idea of wring my own ordering function, because maybe I will do more custom formatting in future.
2. Export to in-memory SQLite, maybe in pipeline (but currently I do not know how or even is it possible)
- Look better to me, because I do not need another script and SQl have ordering, but I do not know here in Scrapy can I access all scraped items?
3. Some other idea?

Upvotes: 0

Views: 560

Answers (2)

WebOrCode
WebOrCode

Reputation: 7294

A solution from my side, maybe it will be useful to somebody.
I used dataset for DB and yagmail for sending the email with Gmail account.
I think jinja was overkill :-)

class SendEmailPipeline(object):
    def open_spider(self, spider):
        db = dataset.connect('sqlite:///:memory:')
        self.table = db['tmp_table']

    def process_item(self, item, spider):
        # need to use dict(item)
        self.table.insert(dict(item))

    def close_spider(self, spider):
        spider.logger.info('SendEmailPipeline.close_spider()')
        # get number of adds
        # must do like this, because it is iterator/generator
        number_of_adds = 0
        for row in self.table.find(order_by=['price_per_m2', 'price_euro']):
            number_of_adds += 1

        # jinja template
        from jinja2 import Template
        template = Template(
        """
        <b>There are {{number_of_adds}} new ads:</b>
        {% for row in data %}
            {{row['price_per_m2']|round|int}} euro/m2 {{row['price_euro']|round|int}} euro {{row['area_in_m2']|round|int}}m2
            <a href="{{row['url']}}">{{row['title']}}</a>
        {% endfor %}
        """
        )

        email_body = template.render(data=self.table.find(order_by=['price_per_m2', 'price_euro']), number_of_adds=number_of_adds)

        subject = str(number_of_adds) + ' new adds'

        # send email 
        import yagmail
        yag = yag = yagmail.SMTP("GMAIL_EMAIL", "PASSWORD")
        yag.send(to='EMAIL_TO_SEND', subject=subject, contents=email_body)

Upvotes: 0

aufziehvogel
aufziehvogel

Reputation: 7297

Since you said that you only start your crawler once per day and you want to have emails once per day, you can use a custom Pipeline to send the emails and buffer the e-mails in the pipeline until it the spider is finished.

A pipeline receives the items from spiders in the method process_item, so you will collect them there and store them in a list inside the pipelines (pipelines are loaded once and will exist through the run of your spider, so it's OK to have a buffer in a pipeline).

When a spider is finished, it will call the method close_spider of pipelines, so there you can send your mail.

This is the class I use for one of my systems (I simplified it and removed stuff like maximum buffer size, AWS related code, ...):

class SendMailPipeline(object):
    def __init__(self, server, sender_mail, jinja_env):
        self.server = server
        self.sender = sender_mail
        self.items_cache = []
        self.cache_size = cache_size
        self.jinja = jinja_env

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings

        server = smtplib.SMTP(settings['MAIL_SERVER'])
        server.ehlo()
        server.starttls()
        server.ehlo()
        server.login(settings['MAIL_USER'], settings['MAIL_PASSWORD'])

        # I used S3 in my solution to store the mail templates, but
        # I removed all AWS related stuff from this example
        # for simplification
        # feel free to use a standard jinja2 loader
        # (from local file system)
        jinja_env = jinja2.Environment(
            loader=skyscraper.jinja2.loaders.S3Loader(
                s3,
                settings['MAIL_TEMPLATE_BUCKET'],
                settings['MAIL_TEMPLATE_PREFIX']
            ),
            autoescape=jinja2.select_autoescape(['html', 'xml'])
        )

        return cls(server, settings['MAIL_FROM'], jinja_env)

    def process_item(self, item, spider):
        self.items_cache.append(item)

        return item

    def close_spider(self, spider):
        # When the spider is finished, we need to flush the cache one last
        # time to make sure that all items are sent
        self._flush_cache()

    def _flush_cache(self):
        items = self.items_cache

        if len(items) > 0:
            self.items_cache = []  # reset cache to empty
            self._send_mail(items)

    def _send_mail(self, items):
        # mailing_options was originally given from a database
        # feel free to get it from any system you like
        # or set a hard-coded path

        template_path = mailing_options['TemplatePath']
        recipients = mailing_options['Recipients']
        subject = mailing_options['Subject']

        template = self.jinja.get_template(template_path)

        mail_content = template.render(items=items)
        msg = MIMEText(mail_content)
        msg['Subject'] = subject
        msg['From'] = self.sender
        msg['To'] = ', '.join(recipients)

        self.server.send_message(msg)

This example does not do HTML mails, yet, but if you read about sending mails with Python, it should be very simple to add HTML to the mail.

Upvotes: 3

Related Questions