Reputation: 69
Note:- There could be 100 date folders, I need to pick only specific(let's say for 25,26 and 28)
Is there any better way than below ?
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
val spark = SparkSession.builder.appName("ScalaCodeTest").master("yarn").getOrCreate()
val parquetFiles = List("id=200393/date=2019-03-25", "id=200393/date=2019-03-26", "id=200393/date=2019-03-28")
spark.read.format("parquet").load(parquetFiles: _*)
The above code is working but I want to do something like below-
val parquetFiles = List()
parquetFiles(0) = "id=200393/date=2019-03-25"
parquetFiles(1) = "id=200393/date=2019-03-26"
parquetFiles(2) = "id=200393/date=2019-03-28"
spark.read.format("parquet").load(parquetFiles: _*)
Upvotes: 1
Views: 9384
Reputation: 916
you can read it this way to read all folders in a directory id=200393:
val df = spark.read.parquet("id=200393/*")
If you want to select only some dates, for example only september 2019:
val df = spark.read.parquet("id=200393/2019-09-*")
If you have some special days, you can have the list of days in a list
val days = List("2019-09-02", "2019-09-03")
val paths = days.map(day => "id=200393/" ++ day)
val df = spark.read.parquet(paths:_*)
Upvotes: 9
Reputation: 1534
If you want to keep the column 'id', you could try this:
val df = sqlContext
.read
.option("basePath", "id=200393/")
.parquet("id=200393/date=*")
Upvotes: 0