Reputation: 1
I am trying to write a spark transformation code to convert the below data into list of objects of following class, I am totally new to scala and spark and tried splitting the data and put them into case class but I was unable to append them back. Request your help on this.
Data :
FirstName,LastName,Country,match,Goals
Cristiano,Ronaldo,Portugal,Match1,1
Cristiano,Ronaldo,Portugal,Match2,1
Cristiano,Ronaldo,Portugal,Match3,0
Cristiano,Ronaldo,Portugal,Match4,2
Lionel,Messi,Argentina,Match1,1
Lionel,Messi,Argentina,Match2,2
Lionel,Messi,Argentina,Match3,1
Lionel,Messi,Argentina,Match4,2
Desired output:
PLayerStats{ String FirstName,
String LastName,
String Country,
Map <String,Int> matchandscore
}
Upvotes: 0
Views: 1031
Reputation: 37822
Assuming you already loaded data into an RDD[String]
named data
:
case class PlayerStats(FirstName: String, LastName: String, Country: String, matchandscore: Map[String, Int])
val result: RDD[PlayerStats] = data
.filter(!_.startsWith("FirstName")) // remove header
.map(_.split(",")).map { // map into case classes
case Array(fn, ln, cntry, mn, g) => PlayerStats(fn, ln, cntry, Map(mn -> g.toInt))
}
.keyBy(p => (p.FirstName, p.LastName)) // key by player
.reduceByKey((p1, p2) => p1.copy(matchandscore = p1.matchandscore ++ p2.matchandscore))
.map(_._2) // remove key
Upvotes: 1
Reputation: 214927
You could try something as follows:
val file = sc.textFile("myfile.csv")
val df = file.map(line => line.split(",")). // split line by comma
filter(lineSplit => lineSplit(0) != "FirstName"). // filter out first row
map(lineSplit => { // transform lines
(lineSplit(0), lineSplit(1), lineSplit(2), Map((lineSplit(3), lineSplit(4).toInt)))}).
toDF("FirstName", "LastName", "Country", "MatchAndScore")
df.schema
// res34: org.apache.spark.sql.types.StructType = StructType(StructField(FirstName,StringType,true), StructField(LastName,StringType,true), StructField(Country,StringType,true), StructField(MatchAndScore,MapType(StringType,IntegerType,false),true))
df.show
+---------+--------+---------+----------------+
|FirstName|LastName| Country| MatchAndScore|
+---------+--------+---------+----------------+
|Cristiano| Ronaldo| Portugal|Map(Match1 -> 1)|
|Cristiano| Ronaldo| Portugal|Map(Match2 -> 1)|
|Cristiano| Ronaldo| Portugal|Map(Match3 -> 0)|
|Cristiano| Ronaldo| Portugal|Map(Match4 -> 2)|
| Lionel| Messi|Argentina|Map(Match1 -> 1)|
| Lionel| Messi|Argentina|Map(Match2 -> 2)|
| Lionel| Messi|Argentina|Map(Match3 -> 1)|
| Lionel| Messi|Argentina|Map(Match4 -> 2)|
+---------+--------+---------+----------------+
Upvotes: 0
Reputation: 2294
Firstly convert the line into key value pair say (Cristiano, rest of data)
then apply groupByKey
or reduceByKey
can also work then try to convert the key value pair data after applying groupByKey or reduceByKey into your class by putting value. Take a help of famous word count program.
http://spark.apache.org/examples.html
Upvotes: 0