user2302244
user2302244

Reputation: 935

Reading large files from cloud storage and writing to datastore

I have this cloud function which is triggered from a bucket in cloud storage. It reads the file, converts each line to an rdf triple using N3 and then writes the resultant triple to cloud storage.

Since it downloads the whole file into memory, it is not suitable for large files. How should this function be changed to do this one line at a time?

const storage = require('@google-cloud/storage')();
const Datastore = require('@google-cloud/datastore');
const N3 = require('n3');

helloGCS = (event, callback) => {
    const file = event.data;

    if (file.resourceState === 'not_exists') {
      console.log(`File ${file.name} deleted.`);
      callback(null, 'ok');
    } else if (file.metageneration === '1') {
      // metageneration attribute is updated on metadata changes.
      // on create value is 1
      console.log(`File ${file.name} uploaded.`);
      let parser = N3.Parser();
      const bucket = storage.bucket('woburn-advisory-ttl');
      const remoteFile = bucket.file(file.name);
      const datastore = new Datastore({});
      let number_of_rows = 0;
      remoteFile.download()
          .then(data => {   // convert buffer to string
              if (data) {
                  lines = data.toString().split('\n')
                  console.log(lines.length)
                  entities = lines.map(line=>{
                      let triple = parser.parse(line)[0];
                      if (triple) {
//                          console.log(triple)
                          const tripleKey = datastore.key('triple');
                          let entity = {
                              key: tripleKey,
                              data: [
                                  {
                                      name: 'subject',
                                      value: triple.subject
                                  },
                                  {
                                      name: 'predicate',
                                      value: triple.predicate
                                  },
                                  {
                                      name: 'object',
                                      value: triple.object
                                  }
                              ]
                          }
                          return entity
                      }
                      else {
                          return false
                  }})
                  entities = entities.filter((entity)=>{return entity})
                  console.log(entities.length)
                  datastore.save(entities)
                  .then((response)=>{
                      console.log(`Triples created successfully. but... ${response}`);
                      res.send(`${entities.length} triples created`)
                  })
              }
              callback(null, 'ok');
          })
    }
     else {
        console.log(`File ${file.name} metadata updated.`);
        callback(null, 'ok');
    }
};

Upvotes: 1

Views: 2487

Answers (1)

David
David

Reputation: 9721

Instead of calling download() use createReadStream(). This allows you to loop over the whole file without storing it in memory. You can use something like byline or readline to grab individual lines from that stream.

Overall this will look something like:

gcsStream = remoteFile.createReadStream();
lineStream = byline.createStream(gcsStream);
lineStream.on('data', function(line) {
   let triple = parser.parse(line)[0];
   //...
});

Upvotes: 3

Related Questions