Nikhil Shekhar
Nikhil Shekhar

Reputation: 3

Want to parse a file and reformat it to create a pairRDD in Spark through Scala

I have dataset in a file in the form:

1: 1664968

2: 3 747213 1664968 1691047 4095634 5535664

3: 9 77935 79583 84707 564578 594898 681805 681886 835470 880698 

4: 145

5: 8 57544 58089 60048 65880 284186 313376 

6: 8

I need to transform this to something like below using Spark and Scala as a part of preprocessing of data:

1 1664968

2 3

2 747213

2 1664968

2 4095634 

2 5535664

3 9

3 77935

3 79583

3 84707

And so on....

Can anyone provide input on how this can be done. The length of the original rows in the file varies as shown in the dataset example above.

I am not sure, how to go about doing this transformation.

I tried soemthing like below which gives me a pair of the key and the first element after the semi-colon.

But I am not sure how to iterate over the entire data and generate the pairs as needed.

def main(args: Array[String]): Unit = {
  val sc = new SparkContext(new SparkConf().setAppName("Graphx").setMaster("local"))
  val rawLinks = sc.textFile("src/main/resources/links-simple-sorted-top100.txt")

  rawLinks.take(5).foreach(println)

  val formattedLinks = rawLinks.map{ rows =>
    val fields = rows.split(":")
    val fromVertex = fields(0)
    val toVerticesArray = fields(1).split(" ")
    (fromVertex, toVerticesArray(1))
  }

  val topFive = formattedLinks.take(5)
  topFive.foreach(println)
}

Upvotes: 0

Views: 131

Answers (2)

Justin Pihony
Justin Pihony

Reputation: 67115

val rdd = sc.parallelize(List("1: 1664968","2: 3 747213 1664968 1691047 4095634 5535664"))
val keyValues = rdd.flatMap(line => {
  val Array(key, values) = line.split(":",2)
  for(value <- values.trim.split("""\s+""")) 
    yield (key, value.trim)
})
keyValues.collect

Upvotes: 1

Shyamendra Solanki
Shyamendra Solanki

Reputation: 8851

split row in 2 parts and map on variable number of columns.

def transform(s: String): Array[String] = { 
  val Array(head, tail) = s.split(":", 2)
  tail.trim.split("""\s+""").map(x => s"$head $x")
}

> transform("2: 3 747213 1664968 1691047 4095634 5535664")
// Array(2 3, 2 747213, 2 1664968, 2 1691047, 2 4095634, 2 5535664)

Upvotes: 0

Related Questions