alifirat
alifirat

Reputation: 2937

Efficiently way to read binary files in scala

I'm trying to read a binary file (16 MB) in which I have only integers coded on 16 bits. So for that, I used chunks of 1 MB which gives me an array of bytes. For my own needs, I convert this byte array to a short array with the following function convert but reading this file with a buffer and convert it into a short array take me 5 seconds, is it a faster way than my solution ?

 def convert(in: Array[Byte]): Array[Short] = in.grouped(2).map {
    case Array(one) => (one << 8 | (0 toByte)).toShort
    case Array(hi, lo) => (hi << 8 | lo).toShort
  } .toArray

  val startTime = System.nanoTime()

val file = new RandomAccessFile("foo","r")
val defaultBlockSize = 1 * 1024 * 1024
    val byteBuffer = new Array[Byte](defaultBlockSize)
    val chunkNums = (file.length / defaultBlockSize).toInt
    for (i <- 1 to chunkNums) {
      val seek = (i - 1) * defaultBlockSize
      file.seek(seek)
      file.read(byteBuffer)
      val s = convert(byteBuffer)
      println(byteBuffer size)
    }

val stopTime = System.nanoTime()
  println("Perf of = " + ((stopTime - startTime) / 1000000000.0) + " for a duration of " + duration + " s")

Upvotes: 0

Views: 2163

Answers (2)

Rex Kerr
Rex Kerr

Reputation: 167901

16 MB easily fits in memory unless you're running this on a feature phone or something. No need to chunk it and make the logic harder.

Just gulp the whole file at once with java.nio.files.Files.readAllBytes:

val buffer = java.nio.files.Files.readAllBytes(myfile.toPath)

assuming you are not stuck with Java 1.6. (If you are stuck with Java 1.6, pre-allocate your buffer size using myfile.size, and use read on a FileInputStream to get it all in one go. It's not much harder, just don't forget to close it!)

Then if you don't want to convert it yourself, you can

val bb = java.nio.ByteBuffer.wrap(buffer)
bb.order(java.nio.ByteOrder.nativeOrder)
val shorts = new Array[Short](buffer.length/2)
bb.asShortBuffer.get(shorts)

And you're done.

Note that this is all Java stuff; there's nothing Scala-specific here save the syntax.

If you're wondering why this is so much faster than your code, it's because grouped(2) boxes the bytes and places them in an array. That's three allocations for every short you want! You can do it yourself by indexing the array directly, and that will be fast, but why would you want to when ByteBuffer and friends do exactly what you need already?

If you really really care about that last (odd) byte, then you can use (buffer.length + 1)/2 for the size of shorts, and tack on a if ((buffer.length) & 1 == 1) shorts(shorts.length-1) = ((bb.get&0xFF) << 8).toShort to grab the last byte.

Upvotes: 3

A couple of issues pop out:

If byteBuffer is always going to be 1024*1024 size then the case Array(one) in convert will never actually be used and therefore pattern matching is unnecessary.

Also, you can avoid the for loop with a tail recursive function. After the val byteBuffer = ... line you can replace the chunkNums and for loop with:

@scala.annotation.tailrec
def readAndConvert(b: List[Array[Short]], file : RandomAccessFile) : List[Array[Short]] = {
  if(file.read(byteBuffer) < 0)
    b
  else {
    file.skipBytes(1024*1024)
    readAndConvert(b.+:(convert(byteBuffer)), file)
  }
}

val sValues = readAndConvert(List.empty[Array[Short]], file)

Note: because list preppending is much faster than appending the above loop gets you the converted value in reverse order from the reading order in the file.

Upvotes: 0

Related Questions