Cole Gorski
Cole Gorski

Reputation: 11

How to read CSV data stored in Google Cloud Storage with Cloud Functions

As part of a communications effort to a large user base, I need to send upwards of 75,000 emails per day. The emails of the users I'm contacting are stored in a CSV file. I've been using Postman Runner to send these requests via SendGrid (Email API), but with such a large volume, my computer either slows way down or Postman completely crashes before the batch completes. Even if it doesn't crash, it takes upwards of 3 hours to send this many POST requests via Runner.

I'd like to upload the CSV containing the emails into a Cloud Storage bucket and then access the file using Cloud Functions to send a POST request for each email. This way, all the processing can be handled by GCP and not by my personal machine. However, I can't seem to get the Cloud Function to read the CSV data line-by-line. I've tried using createReadStream() from the Cloud Storage NodeJS client library along with csv-parser, but can't get this solution to work. Below is what I tried:

const sendGridMail = require('@sendgrid/mail');
const { Storage } = require('@google-cloud/storage');
const fs = require('fs');
const csv = require('csv-parser');

exports.sendMailFromCSV = (file, context) => {

    console.log(`  Event: ${context.eventId}`);
    console.log(`  Event Type: ${context.eventType}`);
    console.log(`  Bucket: ${file.bucket}`);
    console.log(`  File: ${file.name}`);
    console.log(`  Metageneration: ${file.metageneration}`);
    console.log(`  Created: ${file.timeCreated}`);
    console.log(`  Updated: ${file.updated}`);

    const storage = new Storage();
    const bucket = storage.bucket(file.bucket);
    const remoteFile = bucket.file(file.name);
    console.log(remoteFile);

    let emails = [];
       
    fs.createReadStream(remoteFile)
        .pipe(csv())
        .on('data', function (row) {
            console.log(`Email read: ${row.email}`);
            emails.push(row.email);
        //send email using the SendGrid helper library
        const msg = {
                to: [{
                    "email": row.email;
                }],
                from: "[email protected]",
                template_id: "fakeTemplate",
            };

            sendGridMail.send(msg).then(() =>
                context.status(200).send(file.body))
                .catch(function (err) {
                    console.log(err);
                    context.status(400).send(file.body);
                });
        })
        .on('end', function () {
            console.table(emails);
        });    
};

The Cloud Function is currently triggered by an upload to the Cloud Storage bucket.

Is there a way to build a solution to this problem without loading the file into memory? Is Cloud Functions to right path to be moving down, or would it be better to use App Engine or some other tool? Willing to try any GCP solution that moves this process to the cloud

Upvotes: 1

Views: 2361

Answers (2)

al-dann
al-dann

Reputation: 2725

Cloud Function's memory can be shared/used as a temporary directory /tmp. Thus, you can download the csv file from the cloud storage bucket into that directory as a local file, and then process it, as if that file is handled from the local drive.

At the same time, you may would like to remember about 2 main restrictions:

  1. Memory - up to 2Gb for everything
  2. Timeout - no more than 540 seconds per invocation.

I personally would create a solution based on a combination of a few GCP resources.

The first cloud function is triggered by a 'finlize' event - when the csv file is saved in the bucket. This cloud function reads the file and for every record composes a Pub/Sub message with relevant details (enough to send an email). That message is posted into a Pub/Sub topic.

The Pub/Sub topic is used to transfer all messages from the first cloud function to trigger the second cloud function.

The second cloud function is triggered by a Pub/Sub message, which contains all neccessary details to process and send an email. As there may be 75K records in the source csv file (for example), you should expect 75K invocations of the second cloud function.

That may be enough at a high level. Pub/Sub paradigm guarantees at least once delivery (but may be more than once), so if you need no more than one email per address, some additional resources may be required to achieve an idempotent behaviour.

Upvotes: 1

Chris32
Chris32

Reputation: 4961

Basically you will have to download the file locally in the Cloud Function machine to be able to read it in this way.

Now there are multiple options to workaround this.

The most basic/simplest is to provision a Compute Engine machine and run this operation from it if is a once on a time event.

If you need to do this more frequently (i.e. daily) you can use an online tool to convert your csv file into json and import it to Firestore, then you can read a lot faster the emails from Firestore.

Upvotes: 0

Related Questions