Ged
Ged

Reputation: 18003

SPARK datasource V2 API clarification for Filter Push Down

I have been reading on Data Source V2 API and Filter Pushdown (and presumably Partition Pruning). In the examples one talks about Push Down to, say, mySQL.

OK, I am not clear. I see this discussion on datasource V2 API here and there (e.g. in Exploring Spark DataSource V2 - Part 4 : In-Memory DataSource with Partitioning). All good and well, but I can get pushdown working already for mySQL as the answer states. The discussions imply the opposite somehow - so I am clearly missing a point -somewhere along the line and I would like to know what.

My question/observation is that I can already do Filter Push Down for a JDBC Source such as mySQL. E.g. as follows by:

sql = "(select * from mytab where day = 2016-11-25 and hour = 10) t1"

This ensures not all data is brought back to SPARK.

So, what am I missing?

Upvotes: 0

Views: 1014

Answers (2)

Jacek Laskowski
Jacek Laskowski

Reputation: 74619

Filter Pushdown in Data Source V2 API

In Data Source V2 API, only data sources with DataSourceReaders with SupportsPushDownFilters interface support Filter Pushdown performance optimization.

Whether a data source supports filter pushdown in Data Source V2 API is just a matter of checking out the underlying DataSourceReader.

For MySQL it'd be the JDBC data source which is represented by the JdbcRelationProvider that does not seem to support Data Source V2 API (via ReadSupport). In other words, I doubt that MySQL is supported by a Data Source V2 API data source and so no filter pushdown in the new Data Source V2 API is expected.

Filter Pushdown in Data Source V1 API

That does not preclude filter pushdown optimization to be used via some other non-Data Source V2 APIs, i.e. Data Source V1 API.

In the case of the JDBC data source the filter pushdown is indeed supported by the former PrunedFilteredScan contract (which nota bene is used by the JDBCRelation only). That's however Data Source V1 API.

Upvotes: 0

user10233591
user10233591

Reputation: 71

This ensures not all data is brought back to SPARK.

Yes, it does, but

val df = sparkr.read.jdbc(url, "mytab", ...)

df.where($"day" === "2016-11-25" and $"hour" === 10)

should as well, as long as there is not casting required, not matter the version (1.4 forward).

Upvotes: 1

Related Questions