user4549111
user4549111

Reputation:

Sort using apache spark for text input

I want to sort the data using spark-shell (scala).

Input is like (EDIT - first column can contains two spaces)

AQWD  11BC23  A12A
ZXDM  33QWSD  CC12
  DM  EEZM33  FFZ2

I am trying to use sc.textFile("input.txt")

Now I want to sort the data using first column only. I know I need to use sortByKey() but which transformation or action should I apply first so that I can use sortByKey() ? I am getting error of sortByKey is not member of rdd.RDD Array[String] by using below code which doesn't seem right

val lines = sc.textFile("input.txt")
val sort =  lines.map(_.Split("  ")).sortByKey()

Expected output

  DM  33QWSD  CC12    
AQWD  11BC23  A12A
BCDM  EEZM33  FFZ2

As spaces has ascii value of 32 it will be at the top and then remaining sorting data.

Upvotes: 0

Views: 665

Answers (1)

Mateusz Dymczyk
Mateusz Dymczyk

Reputation: 15141

sortByKey() is a so called OrderedRDDFunctions. They are only available for RDD which contain tuples (key,value). Your RDD will contain Array[String]. If you want to do it your way you can do it like this:

val lines = sc.textFile("input.txt")
val sort =  lines.map(_.split("  ")).map(arr => (arr(0),arr.mkString("  "))).sortByKey()

edit: yes you can make those 2 maps in one step, I find this more readable :-)

You can also do it like this:

scala> lines.sortBy[String]( (line:String) => line.split("  ")(0), true, 1 ).foreach(println)
AQWD  11BC23  A12A
BCDM  EEZM33  FFZ2
ZXDM  33QWSD  CC12

@Edit: if your key is different you just need to include it in your logic. For instance if all your delimiters are double spaces you can change the above code to:

lines.map(_.split("  ")).map(arr => (arr(0) + "  " + arr(1),arr.mkString("  ")))

Or the second one:

lines.sortBy[String]( (line:String) => { val split = line.split("  "); split(0) + "  " + split(1) }, true, 1 )

Upvotes: 0

Related Questions