How to read a file with custom delimeter for new line and column in Spark (Scala)

Question

What is the best way to read a text file with new line delimiter as "^*~" and column delimiter as "^|&". I have a file with large number of column like more than 100. Please suggest the efficient way. Below is the file with few fields.

I have a file like

abcd^|&cdef^|&25^|&hile^|&12345^*~xyxxx^|&zzzzz^|&70^|&dharan^|&6567576

I want this file to be like

fname   lname   age address phone 
abcd    cdef    25  abc     1234523
xyxxx   zzzzz   70  xyz     6567576

eliasah · Accepted Answer

You'll need to flatMap and split using the escaped characters for your delimiter in order to create lines and then split on your second delimiter with the same approach and then pattern match to get tuples :

val str = "abcd^|&cdef^|&25^|&hile^|&12345^*~xyxxx^|&zzzzz^|&70^|&dharan^|&6567576"
val rdd = sc.parallelize(Seq(str))
val rdd2 = rdd.flatMap(_.split("\^\*~")).map(_.split("\^\|\&") match {
   case Array(a, b, c, d, e) => (a, b, c, d, e)
})

rdd2.toDF("fname","lname","age","address","phone").show
// +-----+-----+---+-------+-------+                                               
// |fname|lname|age|address|  phone|
// +-----+-----+---+-------+-------+
// | abcd| cdef| 25|   hile|  12345|
// |xyxxx|zzzzz| 70| dharan|6567576|
// +-----+-----+---+-------+-------+

How to read a file with custom delimeter for new line and column in Spark (Scala)

Answers (1)

Related Questions