James
James

Reputation: 47

Reading saved text file from a Spark program into another one

I wrote a Spark program that takes in some input, does various things to the data, and at the end of my processing I have a

val processedData = ...

processedData is of type RDD[(Key, List[Data])] where Key and Data are case classes I have defined.

I then called

processedData.saveAsTextFile(location)

In that location is a folder with a success file and 54 parts files, which I expected to see.

Now, in another program I just began writing to do some statistical analysis on my output, I start with:

val groupedData = sc.textFile(location).cache()

However, my IDE (rightfully), thinks that groupedData is of type RDD[String]

What is the idiomatic way of telling the compiler/IDE that groupedData is of type RDD[(Key, List[Data])]?

Upvotes: 1

Views: 1874

Answers (1)

Zernike
Zernike

Reputation: 1766

Playback:

scala> sc.parallelize(List(1,2,3).zip(List("abc","def","ghi")))
res0: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[0] at parallelize at <console>:22

scala> res0.collect
res1: Array[(Int, String)] = Array((1,abc), (2,def), (3,ghi))

scala> res0.saveAsTextFile("file")

scala> sc.textFile("file")
res3: org.apache.spark.rdd.RDD[String] = file MapPartitionsRDD[3] at textFile at <console>:22

scala> res3.collect
res4: Array[String] = Array((1,abc), (2,def), (3,ghi))

Result is simple string as toString method representation. Documentation:

def saveAsTextFile(path: String): Unit

Save this RDD as a text file, string representations of elements.

How to resolve:

scala> res0.saveAsObjectFile("file1")

scala> sc.objectFile[(Int,String)]("file1")
res9: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[11] at objectFile at <console>:22

scala> res9.collect
res10: Array[(Int, String)] = Array((1,abc), (2,def), (3,ghi))

Documentation:

def saveAsObjectFile(path: String): Unit

Save this RDD as a of serialized objects.

Note, you must specify type parameter when reading from file. It's necessary to deserialization. Spark wants to know what to retreive.

Upvotes: 4

Related Questions