SPARK datasource V2 API clarification for Filter Push Down

Question

I have been reading on Data Source V2 API and Filter Pushdown (and presumably Partition Pruning). In the examples one talks about Push Down to, say, mySQL.

OK, I am not clear. I see this discussion on datasource V2 API here and there (e.g. in Exploring Spark DataSource V2 - Part 4 : In-Memory DataSource with Partitioning). All good and well, but I can get pushdown working already for mySQL as the answer states. The discussions imply the opposite somehow - so I am clearly missing a point -somewhere along the line and I would like to know what.

My question/observation is that I can already do Filter Push Down for a JDBC Source such as mySQL. E.g. as follows by:

sql = "(select * from mytab where day = 2016-11-25 and hour = 10) t1"

This ensures not all data is brought back to SPARK.

So, what am I missing?

Jacek Laskowski · Accepted Answer

Filter Pushdown in Data Source V2 API

In Data Source V2 API, only data sources with DataSourceReaders with SupportsPushDownFilters interface support Filter Pushdown performance optimization.

Whether a data source supports filter pushdown in Data Source V2 API is just a matter of checking out the underlying DataSourceReader.

For MySQL it'd be the JDBC data source which is represented by the JdbcRelationProvider that does not seem to support Data Source V2 API (via ReadSupport). In other words, I doubt that MySQL is supported by a Data Source V2 API data source and so no filter pushdown in the new Data Source V2 API is expected.

Filter Pushdown in Data Source V1 API

That does not preclude filter pushdown optimization to be used via some other non-Data Source V2 APIs, i.e. Data Source V1 API.

In the case of the JDBC data source the filter pushdown is indeed supported by the former PrunedFilteredScan contract (which nota bene is used by the JDBCRelation only). That's however Data Source V1 API.

SPARK datasource V2 API clarification for Filter Push Down

Answers (2)

Filter Pushdown in Data Source V2 API

Filter Pushdown in Data Source V1 API

Related Questions