Reputation: 16076
Recently I was working with Spark with JDBC data source. Consider following snippet:
val df = spark.read.(options).format("jdbc").load();
val newDF = df.where(PRED)
PRED is a list of predicates.
If PRED is a simple predicate, like x = 10
, query will be much faster. However, if there are some non-equi conditions like date > someOtherDate or date < someOtherDate2
, query is much slower than without predicate pushdown. As you may know, DB engines scans of such predicates are very slow, in my case with even 10 times slower (!).
To prevent unnecessary predicate pushdown I used:
val cachedDF = df.cache()
val newDF = cachedDF.where(PRED)
But it requires a lot of memory and - due to problem mentioned here - Spark' Dataset unpersist behaviour - I can't unpersist cachedDF
.
Is there any other option to avoid pushing down predicates? Without caching and without writing own data source?
Note: Even if there is an option to turn off predicate pushdown, it's applicable only is other query may still use it. So, if I wrote:
// some fancy option set to not push down predicates
val df1 = ...
// predicate pushdown works again
val df2 = ...
df1.join(df2)// where df1 without predicate pushdown, but df2 with
Upvotes: 7
Views: 3000
Reputation: 40360
A JIRA ticket has been opened for this issue. You can follow it here : https://issues.apache.org/jira/browse/SPARK-24288
Upvotes: 3