How to process a nested Key Value Pair in Spark / Scala data import

Question

I am new to Spark and Scala, so please forgive the noobness. What I have is a text file which is in this format:

328;ADMIN HEARNG;[street#939 W El Camino,city#Chicago,state#IL]

I have been able to create the RDD using the sc.textFile command, and I can process each section using this command:

val department_record = department_rdd.map(record => record.split(";"))

As you can see, though, the 3rd element is a nested key / value pair, and so far, I have been unable to work with it. What I am looking for is a way to transform the data from the above to an RDD that looks like this:

|ID |NAME        |STREET         |CITY   |STATE|

|328|ADMIN HEARNG|939 W El Camino|Chicago|IL   |

Any help is appreciated.

Leo C · Accepted Answer

You can split the address field at , into an Array, strip away the enclosing bracket and split again at # to extract the wanted address components, as shown below:

val department_rdd = sc.parallelize(Seq(
  "328;ADMIN HEARNG;[street#939 W El Camino,city#Chicago,state#IL]",
  "400;ADMIN HEARNG;[street#800 First Street,city#San Francisco,state#CA]"
))

val department_record = department_rdd.
  map(_.split(";")).
  map{ case Array(id, name, address) =>
    val addressArr = address.split(",").
      map(_.replaceAll("^$$|$$$", "").split("#"))
    (id, name, addressArr(0)(1), addressArr(1)(1), addressArr(2)(1))
  }

department_record.collect
// res1: Array[(String, String, String, String, String)] = Array(
//   (328,ADMIN HEARNG,939 W El Camino,Chicago,IL),
//   (400,ADMIN HEARNG,800 First Street,San Francisco,CA)
// )

In case you want to convert to a DataFrame, simply apply toDF():

department_record.toDF("id", "name", "street", "city", "state").show
// +---+------------+----------------+-------------+-----+
// | id|        name|          street|         city|state|
// +---+------------+----------------+-------------+-----+
// |328|ADMIN HEARNG| 939 W El Camino|      Chicago|   IL|
// |400|ADMIN HEARNG|800 First Street|San Francisco|   CA|
// +---+------------+----------------+-------------+-----+

How to process a nested Key Value Pair in Spark / Scala data import

Answers (2)

Related Questions