user1513388
user1513388

Reputation: 7459

reading large file and splitby method

I'm trying to use the splitby method in highland.js to extract the data between the begin and end delimiters.

        -----BEGIN DATA-----
        MIIEzDCCArSgAwIBAgIVCugKYzMN5ra8zPWxYE8pUU9SxjYSMA0GCSqGSIb3DQEB
        CwUAMHAxCzAJBgNVBAYTAkdCMRUwEwYDVQQIDAxXYXJ3aWNrc2hpcmUxEDAOBgNV
        BAcMB1dhcndpY2sxEDAOBgNVBAoMB0VudHJ1c3QxETAPBgNVBAsMCFBLSSBURUFN
        -----END DATA-----
        -----BEGIN DATA-----
        MIIETzCCAjegAwIBAgIVBShP2Mx74DZEyNKwYZZPGntRmSWnMA0GCSqGSIb3DQEB
        DQUAMHIxCzAJBgNVBAYTAkdCMRUwEwYDVQQIDAxXYXJ3aWNrc2hpcmUxEDAOBgNV
        BAcMB1dhcndpY2sxDDAKBgNVBAoMA0lCTTERMA8GA1UECwwIUEtJIFRFQU0xGTAX
        5/62
        -----END DATA-----

I can read the file into a stream like this:

        const readFile = _.wrapCallback(fs.readFile);
        stream = _(files).map(readFile).parallel(2);

        const blob = _(stream).splitBy('-----BEGIN DATA-----')

However, I can't seem to work out how to process the file and extract the data I need.

Upvotes: 1

Views: 94

Answers (1)

Tad Lispy
Tad Lispy

Reputation: 3056

There are really three concerns here.

  1. Reading the content data from files
  2. Extracting the delimited chunks
  3. Getting the resulting data out of the stream

First you need to read the contents of each file. Note that wrapped readFile will emit Buffers, not Strings. To extract chunks you need to convert content of each file to a String. I assume files are encoded as utf-8.

Second you need to separate the data from the rest of the text. I assume that you only want chunks between the beginning and end delimiters, without delimiters themselves or anything that might be outside of the delimiters, for example:

-----BEGIN DATA-----
MIIEzDCCArSgAwIBAgIVCugKYzMN5ra8zPWxYE8pUU9SxjYSMA0GCSqGSIb3DQEB
CwUAMHAxCzAJBgNVBAYTAkdCMRUwEwYDVQQIDAxXYXJ3aWNrc2hpcmUxEDAOBgNV
BAcMB1dhcndpY2sxEDAOBgNVBAoMB0VudHJ1c3QxETAPBgNVBAsMCFBLSSBURUFN
-----END DATA-----
junky junk junk
-----BEGIN DATA-----
MIIETzCCAjegAwIBAgIVBShP2Mx74DZEyNKwYZZPGntRmSWnMA0GCSqGSIb3DQEB
DQUAMHIxCzAJBgNVBAYTAkdCMRUwEwYDVQQIDAxXYXJ3aWNrc2hpcmUxEDAOBgNV
BAcMB1dhcndpY2sxDDAKBgNVBAoMA0lCTTERMA8GA1UECwwIUEtJIFRFQU0xGTAX
5/62
-----END DATA-----

should result in:

[ '\nMIIEzDCCArSgAwIBAgIVCugKYzMN5ra8zPWxYE8pUU9SxjYSMA0GCSqGSIb3DQEB\nCwUAMHAxCzAJBgNVBAYTAkdCMRUwEwYDVQQIDAxXYXJ3aWNrc2hpcmUxEDAOBgNV\nBAcMB1dhcndpY2sxEDAOBgNVBAoMB0VudHJ1c3QxETAPBgNVBAsMCFBLSSBURUFN\n'
, '\nMIIETzCCAjegAwIBAgIVBShP2Mx74DZEyNKwYZZPGntRmSWnMA0GCSqGSIb3DQEB\nDQUAMHIxCzAJBgNVBAYTAkdCMRUwEwYDVQQIDAxXYXJ3aWNrc2hpcmUxEDAOBgNV\nBAcMB1dhcndpY2sxDDAKBgNVBAoMA0lCTTERMA8GA1UECwwIUEtJIFRFQU0xGTAX\n5/62\n'
]

To get this result I use regular expression with two non-matching groups for delimiters and a matching group for the data. First I extract the delimited chunks, then remove delimiters. This may not be very efficient, but should do the job.

Note that the callback of flatMap will return an array of strings. Using map here would result in a stream of arrays - one for every file. What we want is a single stream of strings. That's why flatMap` is used here.

Finally you need to get the stream flowing and get the data out of it. To do this you need to call a consuming method on the stream. In this example I use toArray. The callback provided to this method will be called with an array containing all elements of the stream - in this case all your data chunks.

Here is the thing in it's entirety:

const Stream = require("highland")
const FS = require("fs")

const files = [ "./input-1.txt", "./input-2.txt"  ]
const readFile = Stream.wrapCallback(FS.readFile);

const pattern = /(?:-----BEGIN DATA-----)((.|\n)+?)(?:-----END DATA-----)/gm

Stream(files)
  // 1. Read contents
  .map(readFile)
  .parallel(2)
  .invoke("toString", ["utf-8"])
  // 2. Process contents to extract data
  .flatMap((content) =>
    content
      // get an array of chunks (including delimiters)
      .match(pattern)
      // remove the delimiters from each chunk, leaving only the data
      .map((chunk) => chunk.replace(pattern, "$1")))
  // 3. Get the resulting data out of the stream
  .toArray((chunks) => 
    console.log(chunks) // will print an array of data chunks
  )

Upvotes: 1

Related Questions