Reputation: 1030
To reduce process time I partitioned my data by dates so that I use only required date data (not complete table).So now in HDFS my tables are stored like below
src_tbl //main dir trg_tbl
2016-01-01 //sub dir 2015-12-30
2016-01-02 2015-12-31
2016-01-03 2016-01-01
2016-01-03
Now I want to select min(date) from src_tbl
which will be 2016-01-01
and from trg_tb
l I want to use data in >= 2016-01-01(src_tbl min(date)) directories which will be
2016-01-01 and 2016-01-03 data`
How can select required partitions or date folder from hdfs using Spark-scala ? After completing process I need to overwrite same date directories too.
Details about process: I want to choose correct window of data (as all other date data in not required) from source and target table..then I want to do join -> lead / lag -> union -> write.
Upvotes: 1
Views: 1253
Reputation: 63022
Spark SQL (including the DataFrame/set api's) is kind of funny in the way it handles partitioned tables wrt retaining the existing partitioning info from one transformation/stage to the next.
For the initial loading Spark SQL tends to do a good job on understanding how to retain the underlying partitioning information - if that information were available in the form of the hive metastore metadata
for the table.
So .. are these hive
tables?
If so - so far so good - you should see the data loaded partition by partition according to the hive
partitions.
Will the DataFrame/Dataset remember this nice partitioning already set up?
Now things get a bit more tricky. The answer depends on whether a shuffle
were required or not.
In your case - a simple filter
operation - there should not be any need. So once again - you should see the original partitioning preserved and thus good performance. Please verify that the partitioning were indeed retained.
I will mention that if any aggregate
functions were invoked then you can be assured your partitioning would be lost. Spark SQL will in that case use a HashPartitioner
-inducing a full shuffle.
Update The OP provided more details here: there is lead/lag
and join
involved. Then he is well advised - from the strictly performance perspective - to avoid the Spark SQL and do the operations manually.
To the OP: the only thing I can suggest at this point is to check
preservesPartitioning=true
were set in your RDD operations. But I am not even sure that capability were exposed by Spark for the lag/lead
: please check.
Upvotes: 2