user2895589
user2895589

Reputation: 1030

Spark-scala how to work with HDFS directory partition

To reduce process time I partitioned my data by dates so that I use only required date data (not complete table).So now in HDFS my tables are stored like below

src_tbl //main dir                             trg_tbl
   2016-01-01 //sub dir                        2015-12-30
   2016-01-02                                  2015-12-31  
   2016-01-03                                  2016-01-01
                                               2016-01-03

Now I want to select min(date) from src_tbl which will be 2016-01-01 and from trg_tbl I want to use data in >= 2016-01-01(src_tbl min(date)) directories which will be2016-01-01 and 2016-01-03 data`

How can select required partitions or date folder from hdfs using Spark-scala ? After completing process I need to overwrite same date directories too.

Details about process: I want to choose correct window of data (as all other date data in not required) from source and target table..then I want to do join -> lead / lag -> union -> write.

Upvotes: 1

Views: 1253

Answers (1)

WestCoastProjects
WestCoastProjects

Reputation: 63022

Spark SQL (including the DataFrame/set api's) is kind of funny in the way it handles partitioned tables wrt retaining the existing partitioning info from one transformation/stage to the next.

For the initial loading Spark SQL tends to do a good job on understanding how to retain the underlying partitioning information - if that information were available in the form of the hive metastore metadata for the table.

So .. are these hive tables?

If so - so far so good - you should see the data loaded partition by partition according to the hive partitions.

Will the DataFrame/Dataset remember this nice partitioning already set up?

Now things get a bit more tricky. The answer depends on whether a shuffle were required or not.

In your case - a simple filter operation - there should not be any need. So once again - you should see the original partitioning preserved and thus good performance. Please verify that the partitioning were indeed retained.

I will mention that if any aggregate functions were invoked then you can be assured your partitioning would be lost. Spark SQL will in that case use a HashPartitioner -inducing a full shuffle.

Update The OP provided more details here: there is lead/lag and join involved. Then he is well advised - from the strictly performance perspective - to avoid the Spark SQL and do the operations manually.

To the OP: the only thing I can suggest at this point is to check

preservesPartitioning=true

were set in your RDD operations. But I am not even sure that capability were exposed by Spark for the lag/lead: please check.

Upvotes: 2

Related Questions