KDC
KDC

Reputation: 1471

How do I split a Spark rdd Array[(String, Array[String])]?

I'm practicing on doing sorts in the Spark shell. I have an rdd with about 10 columns/variables. I want to sort the whole rdd on the values of column 7.

rdd
org.apache.spark.rdd.RDD[Array[String]] = ...

From what I gather the way to do that is by using sortByKey, which in turn only works on pairs. So I mapped it so I'd have a pair consisting of column7 (string values) and the full original rdd (array of strings)

rdd2 = rdd.map(c => (c(7),c))
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] = ...

I then apply sortByKey, still no problem...

rdd3 = rdd2.sortByKey()
rdd3: org.apache.spark.rdd.RDD[(String, Array[String])] = ...

But now how do I split off, collect and save that sorted original rdd from rdd3 (Array[String])? Whenever I try a split on rdd3 it gives me an error:

val rdd4 = rdd3.map(_.split(',')(2))
<console>:33: error: value split is not a member of (String, Array[String])

What am I doing wrong here? Are there other, better ways to sort an rdd on one of its columns?

Upvotes: 4

Views: 11136

Answers (4)

Zahiro Mor
Zahiro Mor

Reputation: 1718

what you did with rdd2 = rdd.map(c => (c(7),c)) is to map it to a tuple. rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] exactly as it says :). now if you want to split the record you need to get it from this tuple. you can map again, taking only the second part of the tuple (which is the array of Array[String]...) like so : rdd3.map(_._2)

but i would strongly suggest to use try rdd.sortBy(_(7)) or something of this sort. this way you do not need to bother yourself with tuple and such.

Upvotes: 2

jtitusj
jtitusj

Reputation: 3086

if you want to sort the rdd using the 7th string in the array, you can just do it directly by

rdd.sortBy(_(6)) // array starts at 0 not 1

or

rdd.sortBy(arr => arr(6))

That will save you all the hassle of doing multiple transformations. The reason why rdd.sortBy(_._7) or rdd.sortBy(x => x._7) won't work is because that's not how you access an element inside an Array. To access the 7th element of an array, say arr, you should do arr(6).

To test this, i did the following:

val rdd = sc.parallelize(Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd")))

// I want to sort it using the 3rd String
val sorted_rdd = rdd.sortBy(_(2))

Here's the result:

Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd"))

Upvotes: 2

Hlib
Hlib

Reputation: 3064

just do this:

val rdd4 = rdd3.map(_._2)

Upvotes: 1

Peerapat A
Peerapat A

Reputation: 430

I thought you don't familiar with Scala, So, below should help you understand more,

rdd3.map(kv => {
  println(kv._1) // This represent String 
  println(kv._2) // This represent Array[String]
})

Upvotes: 0

Related Questions