Reputation: 3652
i'm somewhat new to NodeJS streams, and the more i learn about it, the more i believe it's not a particularly simple and stable thing. i'm attempting to read big files with csv / csv-parse (apparently the most popular CSV module with NodeJS) using the piping API, which involves using stream-transform by the same author.
part of what i'm experiencing here is actually reproducible without actually using the parser, so i out-commented those parts to make the example simpler (for those who prefer JavaScript over CoffeeScript, there's also a JS version):
#-------------------------------------------------------------------------------
fs = require 'fs'
transform_stream = require 'stream-transform'
log = console.log
as_transformer = ( method ) -> transform_stream method, parallel: 11
# _new_csv_parser = require 'csv-parse'
# new_csv_parser = -> _new_csv_parser delimiter: ','
#-------------------------------------------------------------------------------
$count = ( input_stream, title ) ->
count = 0
#.............................................................................
input_stream.on 'end', ->
log ( title ? 'Count' ) + ':', count
#.............................................................................
return as_transformer ( record, handler ) =>
count += 1
handler null, record
#-------------------------------------------------------------------------------
read_trips = ( route, handler ) ->
# parser = new_csv_parser()
input = fs.createReadStream route
#.............................................................................
input.on 'end', ->
log 'ok: trips'
return handler null
input.setMaxListeners 100 # <<<<<<
#.............................................................................
# input.pipe parser
input.pipe $count input, 'trips A'
.pipe $count input, 'trips B'
.pipe $count input, 'trips C'
.pipe $count input, 'trips D'
# ... and so on ...
.pipe $count input, 'trips Z'
#.............................................................................
return null
route = '/Volumes/Storage/cnd/node_modules/timetable-data/germany-berlin-2014/trips.txt'
read_trips route, ( error ) ->
throw error if error?
log 'ok'
the input file contains 204865 lines of GTFS data; i'm not parsing it here, just reading it raw, so i guess what i'm counting with the above code is chunks of data.
i'm piping the stream from counter to counter and would expect to hit the last counter as often as the first one; however, this is what i get:
trips A: 157
trips B: 157
trips C: 157
...
trips U: 157
trips V: 144
trips W: 112
trips X: 80
trips Y: 48
trips Z: 16
in an earlier setup where i actually did parse the data, i got this:
trips A: 204865
trips B: 204865
trips C: 204865
...
trips T: 204865
trips U: 180224
trips V: 147456
trips W: 114688
trips X: 81920
trips Y: 49152
trips Z: 16384
so it would appear that the stream somehow runs dry along its way.
my suspicion was that the end
event of the input stream is not a reliable signal to listen to when
trying to decide whether all processing has finished—after all, it is logical to assume that processing
can only complete some time after the stream has been fully consumed.
so i looked for another event to listen to (didn't find one) and to delay calling the callback (with
setTimeout
, process.nextTick
and setImmediate
), but to no avail.
it would be great if someone could point out
setTimeout
, process.nextTick
and setImmediate
are in this context, andUpdate i now believe the problem lies with stream-transform which has an issue open where someone reported a very similar problem with practically identical figures (he has 234841 records and ends up with 16390, i have 204865 and end up with 16384). not a proof, but too close to be accidental.
i ditched stream-transform and use event-stream.map instead; the test then runs OK.
Upvotes: 3
Views: 4082
Reputation: 3652
some days later i think i can say that stream-transform has a problem with big files.
i've since switched to event-stream which is IMHO a better solution overall as it is completely generic (i.e. it's about streams in general, not about CSV-data-as-streams in particular). i've outlined some thoughts about stream libraries in NodeJS in the documentation for my incipient pipdreams module that provides a number of commonly used stream operations.
Upvotes: 2