Daniel Wu
Daniel Wu

Reputation: 6003

process bunch of string effective

I need to read some data from a file in chuck of 128M, and then for each line, I will do some processing, naive way to do is using split to convert the string into collection of lines and then process each line, but maybe that is not effective as it will create a collection which simply stores the temp result which could be costy. Is there is a way with better performance?

The file is huge, so I kicked off several thread, each thread will pick up 128 chuck, in the following script rawString is a chuck of 128M.

randomAccessFile.seek(start)
randomAccessFile.read(byteBuffer)
val rawString = new String(byteBuffer)
val lines=rawString.split("\n")
for(line <- lines){
    ...
}

Upvotes: 0

Views: 99

Answers (2)

Rex Kerr
Rex Kerr

Reputation: 167871

I'm not sure what you're going to do with the trailing bits of lines at the beginning and end of the chunk. I'll leave that to you to figure out--this solution captures everything delimited on both sides by \n.

Anyway, assuming that byteBuffer is actually an array of bytes and not a java.nio.ByteBuffer, and that you're okay with just handling Unix line encodings, you would want to

def lines(bs: Array[Byte]): Array[String] = {
  val xs = Array.newBuilder[Int]
  var i = 0
  while (i<bs.length) {
    if (bs(i)=='\n') xs += i
    i += 1
  }
  val ix = xs.result
  val ss = new Array[String](0 max (ix.length-1))
  i = 1
  while (i < ix.length) {
    ss(i-1) = new String(bs, ix(i-1)+1, ix(i)-ix(i-1)-1)
    i += 1
  }
  ss
}

Of course this is rather long and messy code, but if you're really worried about performance this sort of thing (heavy use of low-level operations on primitives) is the way to go. (This also takes only ~3x the memory of the chunk on disk instead of ~5x (for mostly/entirely ASCII data) since you don't need the full string representation around.)

Upvotes: 1

ntalbs
ntalbs

Reputation: 29438

It'd be better to read text line by line:

import scala.io.Source
for(line <- Source.fromFile("file.txt").getLines()) {
  ...
}

Upvotes: 2

Related Questions