Reputation: 6003
I need to read some data from a file in chuck of 128M, and then for each line, I will do some processing, naive way to do is using split to convert the string into collection of lines and then process each line, but maybe that is not effective as it will create a collection which simply stores the temp result which could be costy. Is there is a way with better performance?
The file is huge, so I kicked off several thread, each thread will pick up 128 chuck, in the following script rawString is a chuck of 128M.
randomAccessFile.seek(start)
randomAccessFile.read(byteBuffer)
val rawString = new String(byteBuffer)
val lines=rawString.split("\n")
for(line <- lines){
...
}
Upvotes: 0
Views: 99
Reputation: 167871
I'm not sure what you're going to do with the trailing bits of lines at the beginning and end of the chunk. I'll leave that to you to figure out--this solution captures everything delimited on both sides by \n
.
Anyway, assuming that byteBuffer
is actually an array of bytes and not a java.nio.ByteBuffer
, and that you're okay with just handling Unix line encodings, you would want to
def lines(bs: Array[Byte]): Array[String] = {
val xs = Array.newBuilder[Int]
var i = 0
while (i<bs.length) {
if (bs(i)=='\n') xs += i
i += 1
}
val ix = xs.result
val ss = new Array[String](0 max (ix.length-1))
i = 1
while (i < ix.length) {
ss(i-1) = new String(bs, ix(i-1)+1, ix(i)-ix(i-1)-1)
i += 1
}
ss
}
Of course this is rather long and messy code, but if you're really worried about performance this sort of thing (heavy use of low-level operations on primitives) is the way to go. (This also takes only ~3x the memory of the chunk on disk instead of ~5x (for mostly/entirely ASCII data) since you don't need the full string representation around.)
Upvotes: 1
Reputation: 29438
It'd be better to read text line by line:
import scala.io.Source
for(line <- Source.fromFile("file.txt").getLines()) {
...
}
Upvotes: 2