Best way to process a 1GB size pipe separated text file in an azure function

Question

I have a 1GB text file in blob storage and I'm currently building a function that will take the contents of each line and send them to an external API. this external API is limited to 200 requests per second, and I'm also restricted to a ten-minute runtime for my functions due to my subscription plan.

I'm looking into using durable functions to handle this use case by reading the file in chunks. I have written the following code to test out reading the code into chucks.

module.exports = async function (context, myTimer) {
    context.log("Trigger fired");
    
    if (myTimer.isPastDue)
    {
        context.log('JavaScript is running late!');
    }

    const containerClient = getContainerClient(process.env.AzureWebJobsStorage, 'location');
    for await (const blob of containerClient.listBlobsFlat()) {
        if(blob.name !== 'test.txt') {
            continue
        }

        const blobClient = containerClient.getBlobClient(blob.name);
        const downloadBlockBlobResponse = await blobClient.download(0, (1024 * 1024));
        try{
            const blobContent = (
              await streamToBuffer(downloadBlockBlobResponse.readableStreamBody)
            ).toString();
            context.log(blobContent);
        }
        catch (error) {
            context.log(`ERROR: issues reading the following file - ${blob.name}, due to the following error : ${error.message}`);
        } 
    }
    context.log("Trigger completed");
};

async function streamToBuffer(readableStream) {
  return new Promise((resolve, reject) => {
    const chunks = [];
    readableStream.on("data", (data) => {
      chunks.push(data instanceof Buffer ? data : Buffer.from(data));
    });
    readableStream.on("end", () => {
      resolve(Buffer.concat(chunks));
    });
    readableStream.on("error", reject);
  });
}

However, when I read a single MB of the text file, the chunk ends in the middle of a line and not at the end, meaning I can not send the final line to the API.

Does anyone have any ideas on how I can guarantee that the chunk of data will always contain full lines? or is there a better way to handle this use case in Azure?

The contents of the file will look something like this

Test|TestTwo|test@test.com|USA|New York|1234|main|street|12347|711|1973-09-09

jfriend00 · Accepted Answer

Unless you know in advance something about line lengths of each line (like every line is exactly 128 bytes long or something like that), there is no way to always read to a perfect line boundary.

Instead, you have to keep reading small amounts consecutively until you get to the end of a line and then mark in your temporary storage where to continue reading on the next pass to pick up reading the next line.

For example, if a typical line is 100 bytes long and you end up with a partial line (which you will almost always), then read another 250 bytes or so until you find the end of the line you're currently on. Then, calculate what file position that line ended on and store that for the next pass.

Best way to process a 1GB size pipe separated text file in an azure function

Answers (1)

Related Questions