Reputation: 1821
I have a folder which has several part files from an earlier job The folder name is "tera-output", and it has the files part-00000, part-00001, part-00002 etc..till part-00049. I am trying to write a scala program to now read each file in the folder. The code is given below:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("WordCount")
val sc = new SparkContext(conf)
val file = sc.textFile("s3n://test/tera-output")
val splits = file.map(word => word)
println(splits.count())
}
}
My problem is I don't know how to run a loop to read each file in the "tera-output" folder. Each file will be read through map(), and the data appended to "splits". I looked through some of the documentation and posts here and could not really find a function to do that.
Could someone please help with this? Thank you in advance!!
Upvotes: 1
Views: 4599
Reputation:
You can use sc.wholeTextFiles("mydir")
API.
This will return PaidRDD
where Key is the file name and value is the file content.
Upvotes: 5