Bhushan
Bhushan

Reputation: 1

Transformation of data into list of objects of class in spark scala

I am trying to write a spark transformation code to convert the below data into list of objects of following class, I am totally new to scala and spark and tried splitting the data and put them into case class but I was unable to append them back. Request your help on this.

Data :

FirstName,LastName,Country,match,Goals
Cristiano,Ronaldo,Portugal,Match1,1
Cristiano,Ronaldo,Portugal,Match2,1
Cristiano,Ronaldo,Portugal,Match3,0
Cristiano,Ronaldo,Portugal,Match4,2
Lionel,Messi,Argentina,Match1,1
Lionel,Messi,Argentina,Match2,2
Lionel,Messi,Argentina,Match3,1
Lionel,Messi,Argentina,Match4,2

Desired output:

PLayerStats{ String FirstName,
    String LastName,
    String Country,
    Map <String,Int> matchandscore
}

Upvotes: 0

Views: 1031

Answers (3)

Tzach Zohar
Tzach Zohar

Reputation: 37822

Assuming you already loaded data into an RDD[String] named data:

case class PlayerStats(FirstName: String, LastName: String, Country: String, matchandscore: Map[String, Int])

val result: RDD[PlayerStats] = data
  .filter(!_.startsWith("FirstName")) // remove header
  .map(_.split(",")).map { // map into case classes
    case Array(fn, ln, cntry, mn, g) => PlayerStats(fn, ln, cntry, Map(mn -> g.toInt))
  }
  .keyBy(p => (p.FirstName, p.LastName)) // key by player
  .reduceByKey((p1, p2) => p1.copy(matchandscore = p1.matchandscore ++ p2.matchandscore)) 
  .map(_._2) // remove key

Upvotes: 1

akuiper
akuiper

Reputation: 214927

You could try something as follows:

val file = sc.textFile("myfile.csv")

val df = file.map(line => line.split(",")).       // split line by comma
              filter(lineSplit => lineSplit(0) != "FirstName").  // filter out first row
              map(lineSplit => {            // transform lines
              (lineSplit(0), lineSplit(1), lineSplit(2), Map((lineSplit(3), lineSplit(4).toInt)))}).
              toDF("FirstName", "LastName", "Country", "MatchAndScore")         

df.schema
// res34: org.apache.spark.sql.types.StructType = StructType(StructField(FirstName,StringType,true), StructField(LastName,StringType,true), StructField(Country,StringType,true), StructField(MatchAndScore,MapType(StringType,IntegerType,false),true))

df.show

+---------+--------+---------+----------------+
|FirstName|LastName|  Country|   MatchAndScore|
+---------+--------+---------+----------------+
|Cristiano| Ronaldo| Portugal|Map(Match1 -> 1)|
|Cristiano| Ronaldo| Portugal|Map(Match2 -> 1)|
|Cristiano| Ronaldo| Portugal|Map(Match3 -> 0)|
|Cristiano| Ronaldo| Portugal|Map(Match4 -> 2)|
|   Lionel|   Messi|Argentina|Map(Match1 -> 1)|
|   Lionel|   Messi|Argentina|Map(Match2 -> 2)|
|   Lionel|   Messi|Argentina|Map(Match3 -> 1)|
|   Lionel|   Messi|Argentina|Map(Match4 -> 2)|
+---------+--------+---------+----------------+

Upvotes: 0

Akash Sethi
Akash Sethi

Reputation: 2294

Firstly convert the line into key value pair say (Cristiano, rest of data) then apply groupByKey or reduceByKey can also work then try to convert the key value pair data after applying groupByKey or reduceByKey into your class by putting value. Take a help of famous word count program.

http://spark.apache.org/examples.html

Upvotes: 0

Related Questions