Sort using apache spark for text input

Question

I want to sort the data using spark-shell (scala).

Input is like (EDIT - first column can contains two spaces)

AQWD  11BC23  A12A
ZXDM  33QWSD  CC12
  DM  EEZM33  FFZ2

I am trying to use sc.textFile("input.txt")

Now I want to sort the data using first column only. I know I need to use sortByKey() but which transformation or action should I apply first so that I can use sortByKey() ? I am getting error of sortByKey is not member of rdd.RDD Array[String] by using below code which doesn't seem right

val lines = sc.textFile("input.txt")
val sort =  lines.map(_.Split("  ")).sortByKey()

Expected output

  DM  33QWSD  CC12    
AQWD  11BC23  A12A
BCDM  EEZM33  FFZ2

As spaces has ascii value of 32 it will be at the top and then remaining sorting data.

Mateusz Dymczyk · Accepted Answer

sortByKey() is a so called OrderedRDDFunctions. They are only available for RDD which contain tuples (key,value). Your RDD will contain Array[String]. If you want to do it your way you can do it like this:

val lines = sc.textFile("input.txt")
val sort =  lines.map(_.split("  ")).map(arr => (arr(0),arr.mkString("  "))).sortByKey()

edit: yes you can make those 2 maps in one step, I find this more readable :-)

You can also do it like this:

scala> lines.sortBy[String]( (line:String) => line.split("  ")(0), true, 1 ).foreach(println)
AQWD  11BC23  A12A
BCDM  EEZM33  FFZ2
ZXDM  33QWSD  CC12

@Edit: if your key is different you just need to include it in your logic. For instance if all your delimiters are double spaces you can change the above code to:

lines.map(_.split("  ")).map(arr => (arr(0) + "  " + arr(1),arr.mkString("  ")))

Or the second one:

lines.sortBy[String]( (line:String) => { val split = line.split("  "); split(0) + "  " + split(1) }, true, 1 )

Sort using apache spark for text input

Answers (1)

Related Questions