Arik
Arik

Reputation: 115

Reading binary file in Spark Scala

I need to extract data from a binary file.

I used binaryRecords and get RDD[Array[Byte]].

From here I want to parse every record into case class (Field1: Int, Filed2 : Short, Field3: Long)

How can I do this?

Upvotes: 5

Views: 8534

Answers (2)

Naveen Nelamali
Naveen Nelamali

Reputation: 1164

Since Spark 3.0, Spark has a “binaryFile” data source to read Binary file

I've found this at How to read Binary file into DataFrame with more explanation.

val df = spark.read.format("binaryFile").load("/tmp/binary/spark.png")
  df.printSchema()
  df.show()

This outputs schema and DataFrame as below

root
 |-- path: string (nullable = true)
 |-- modificationTime: timestamp (nullable = true)
 |-- length: long (nullable = true)
 |-- content: binary (nullable = true)

+--------------------+--------------------+------+--------------------+
|                path|    modificationTime|length|             content|
+--------------------+--------------------+------+--------------------+
|file:/C:/tmp/bina...|2020-07-25 10:11:...| 74675|[89 50 4E 47 0D 0...|
+--------------------+--------------------+------+--------------------+

Thanks

Upvotes: 1

GameOfThrows
GameOfThrows

Reputation: 4510

assuming you have no delimiter, an Int in Scala is 4 bytes, Short is 2 byte and long is 8 bytes. Assume that your Binary data was structured (for each line) as Int Short Long. You should be able to take the bytes and convert them to the classes you want.

import java.nio.ByteBuffer

val result = YourRDD.map(x=>(ByteBuffer.wrap(x.take(4)).getInt,
             ByteBuffer.wrap(x.drop(4).take(2)).getShort,
             ByteBuffer.wrap(x.drop(6)).getLong))

This uses a Java library to convert Bytes to Int/Short/Long, you can use other libraries if you want.

Upvotes: 4

Related Questions