Reputation:
I want to sort the data using spark-shell (scala).
Input is like (EDIT - first column can contains two spaces)
AQWD 11BC23 A12A
ZXDM 33QWSD CC12
DM EEZM33 FFZ2
I am trying to use sc.textFile("input.txt")
Now I want to sort the data using first column only. I know I need to use sortByKey() but which transformation or action should I apply first so that I can use sortByKey() ? I am getting error of sortByKey is not member of rdd.RDD Array[String] by using below code which doesn't seem right
val lines = sc.textFile("input.txt")
val sort = lines.map(_.Split(" ")).sortByKey()
Expected output
DM 33QWSD CC12
AQWD 11BC23 A12A
BCDM EEZM33 FFZ2
As spaces has ascii value of 32 it will be at the top and then remaining sorting data.
Upvotes: 0
Views: 665
Reputation: 15141
sortByKey()
is a so called OrderedRDDFunctions. They are only available for RDD
which contain tuples (key,value)
. Your RDD
will contain Array[String]
. If you want to do it your way you can do it like this:
val lines = sc.textFile("input.txt")
val sort = lines.map(_.split(" ")).map(arr => (arr(0),arr.mkString(" "))).sortByKey()
edit: yes you can make those 2 maps in one step, I find this more readable :-)
You can also do it like this:
scala> lines.sortBy[String]( (line:String) => line.split(" ")(0), true, 1 ).foreach(println)
AQWD 11BC23 A12A
BCDM EEZM33 FFZ2
ZXDM 33QWSD CC12
@Edit: if your key is different you just need to include it in your logic. For instance if all your delimiters are double spaces you can change the above code to:
lines.map(_.split(" ")).map(arr => (arr(0) + " " + arr(1),arr.mkString(" ")))
Or the second one:
lines.sortBy[String]( (line:String) => { val split = line.split(" "); split(0) + " " + split(1) }, true, 1 )
Upvotes: 0