Spark-scala how to work with HDFS directory partition

Question

To reduce process time I partitioned my data by dates so that I use only required date data (not complete table).So now in HDFS my tables are stored like below

src_tbl //main dir                             trg_tbl
   2016-01-01 //sub dir                        2015-12-30
   2016-01-02                                  2015-12-31  
   2016-01-03                                  2016-01-01
                                               2016-01-03

Now I want to select min(date) from src_tbl which will be 2016-01-01 and from trg_tbl I want to use data in >= 2016-01-01(src_tbl min(date)) directories which will be2016-01-01 and 2016-01-03 data`

How can select required partitions or date folder from hdfs using Spark-scala ? After completing process I need to overwrite same date directories too.

Details about process: I want to choose correct window of data (as all other date data in not required) from source and target table..then I want to do join -> lead / lag -> union -> write.

Spark-scala how to work with HDFS directory partition

Answers (1)

Related Questions