user3033194
user3033194

Reputation: 1821

Read multiple files in a folder with Scala for a Spark job

I have a folder which has several part files from an earlier job The folder name is "tera-output", and it has the files part-00000, part-00001, part-00002 etc..till part-00049. I am trying to write a scala program to now read each file in the folder. The code is given below:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel

object SimpleApp {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("WordCount")
    val sc = new SparkContext(conf)
    val file = sc.textFile("s3n://test/tera-output")
    val splits = file.map(word => word)
    println(splits.count())
  }
}

My problem is I don't know how to run a loop to read each file in the "tera-output" folder. Each file will be read through map(), and the data appended to "splits". I looked through some of the documentation and posts here and could not really find a function to do that.

Could someone please help with this? Thank you in advance!!

Upvotes: 1

Views: 4599

Answers (1)

user1261215
user1261215

Reputation:

You can use sc.wholeTextFiles("mydir") API.

This will return PaidRDD where Key is the file name and value is the file content.

Upvotes: 5

Related Questions