Reputation: 47
I wrote a Spark program that takes in some input, does various things to the data, and at the end of my processing I have a
val processedData = ...
processedData is of type RDD[(Key, List[Data])] where Key and Data are case classes I have defined.
I then called
processedData.saveAsTextFile(location)
In that location is a folder with a success file and 54 parts files, which I expected to see.
Now, in another program I just began writing to do some statistical analysis on my output, I start with:
val groupedData = sc.textFile(location).cache()
However, my IDE (rightfully), thinks that groupedData is of type RDD[String]
What is the idiomatic way of telling the compiler/IDE that groupedData is of type RDD[(Key, List[Data])]?
Upvotes: 1
Views: 1874
Reputation: 1766
Playback:
scala> sc.parallelize(List(1,2,3).zip(List("abc","def","ghi")))
res0: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[0] at parallelize at <console>:22
scala> res0.collect
res1: Array[(Int, String)] = Array((1,abc), (2,def), (3,ghi))
scala> res0.saveAsTextFile("file")
scala> sc.textFile("file")
res3: org.apache.spark.rdd.RDD[String] = file MapPartitionsRDD[3] at textFile at <console>:22
scala> res3.collect
res4: Array[String] = Array((1,abc), (2,def), (3,ghi))
Result is simple string as toString method representation. Documentation:
def saveAsTextFile(path: String): Unit
Save this RDD as a text file, string representations of elements.
How to resolve:
scala> res0.saveAsObjectFile("file1")
scala> sc.objectFile[(Int,String)]("file1")
res9: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[11] at objectFile at <console>:22
scala> res9.collect
res10: Array[(Int, String)] = Array((1,abc), (2,def), (3,ghi))
Documentation:
def saveAsObjectFile(path: String): Unit
Save this RDD as a of serialized objects.
Note, you must specify type parameter when reading from file. It's necessary to deserialization. Spark wants to know what to retreive.
Upvotes: 4