Reputation: 3076
Spark provide method saveAsTextFile
which can store RDD[T]
into disk or hdfs easily.
T is an arbitrary serializable class.
I want to reverse the operation.
I wonder whether there is a loadFromTextFile
which can easily load a file into RDD[T]
?
Let me make it clear:
class A extends Serializable {
...
}
val path:String = "hdfs..."
val d1:RDD[A] = create_A
d1.saveAsTextFile(path)
val d2:RDD[A] = a_load_function(path) // this is the function I want
//d2 should be the same as d1
Upvotes: 5
Views: 9784
Reputation:
To create file based RDD, We can use SparkContext.textFile API
Below is an example:
val textFile = sc.textFile("input.txt")
We can specify the URI explicitly.
If the file is in HDFS:
sc.textFile("hdfs://host:port/filepath")
If the file is in local:
sc.textFile("file:///path to the file/")
If the file is S3:
s3.textFile("s3n://mybucket/sample.txt");
To load RDD to Speicific type:
case class Person(name: String, age: Int)
val people = sc.textFile("employees.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
Here, people will be of type org.apache.spark.rdd.RDD[Person]
Upvotes: 0
Reputation: 6693
Try to use d1.saveAsObjectFile(path)
to store and val d2 = sc.objectFile[A](path)
to load.
I think you cannot saveAsTextFile
and read it out as RDD[A]
without transformation from RDD[String]
Upvotes: 10