worldterminator
worldterminator

Reputation: 3076

How to load data from saved file with Spark

Spark provide method saveAsTextFile which can store RDD[T] into disk or hdfs easily.

T is an arbitrary serializable class.

I want to reverse the operation. I wonder whether there is a loadFromTextFile which can easily load a file into RDD[T]?

Let me make it clear:

class A extends Serializable {
...
}

val path:String = "hdfs..."
val d1:RDD[A] = create_A

d1.saveAsTextFile(path)

val d2:RDD[A] = a_load_function(path) // this is the function I want

//d2 should be the same as d1

Upvotes: 5

Views: 9784

Answers (2)

user1261215
user1261215

Reputation:

To create file based RDD, We can use SparkContext.textFile API

Below is an example:

val textFile = sc.textFile("input.txt")

We can specify the URI explicitly.

If the file is in HDFS:
sc.textFile("hdfs://host:port/filepath")

If the file is in local:
sc.textFile("file:///path to the file/")

If the file is S3:

s3.textFile("s3n://mybucket/sample.txt");

To load RDD to Speicific type:

case class Person(name: String, age: Int)

val people = sc.textFile("employees.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))

Here, people will be of type org.apache.spark.rdd.RDD[Person]

Upvotes: 0

yjshen
yjshen

Reputation: 6693

Try to use d1.saveAsObjectFile(path) to store and val d2 = sc.objectFile[A](path) to load.

I think you cannot saveAsTextFile and read it out as RDD[A] without transformation from RDD[String]

Upvotes: 10

Related Questions