Mradula Ghatiya
Mradula Ghatiya

Reputation: 69

How can I read multiple parquet files in spark scala

Below are some folders, which might keep updating with time. They have multiple .parquet files. How can I read them in a Spark dataframe in scala ?

Note:- There could be 100 date folders, I need to pick only specific(let's say for 25,26 and 28)

Is there any better way than below ?

import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._

val spark = SparkSession.builder.appName("ScalaCodeTest").master("yarn").getOrCreate()
val parquetFiles = List("id=200393/date=2019-03-25", "id=200393/date=2019-03-26", "id=200393/date=2019-03-28")

spark.read.format("parquet").load(parquetFiles: _*)

The above code is working but I want to do something like below-

val parquetFiles = List()
parquetFiles(0) = "id=200393/date=2019-03-25"
parquetFiles(1) = "id=200393/date=2019-03-26"
parquetFiles(2) = "id=200393/date=2019-03-28"
spark.read.format("parquet").load(parquetFiles: _*)

Upvotes: 1

Views: 9384

Answers (2)

firsni
firsni

Reputation: 916

you can read it this way to read all folders in a directory id=200393:

val df  = spark.read.parquet("id=200393/*")

If you want to select only some dates, for example only september 2019:

val df  = spark.read.parquet("id=200393/2019-09-*")

If you have some special days, you can have the list of days in a list

  val days = List("2019-09-02", "2019-09-03")
  val paths = days.map(day => "id=200393/" ++ day)
  val df = spark.read.parquet(paths:_*)

Upvotes: 9

seiya
seiya

Reputation: 1534

If you want to keep the column 'id', you could try this:

val df = sqlContext
     .read
     .option("basePath", "id=200393/")
     .parquet("id=200393/date=*")

Upvotes: 0

Related Questions