Apache Spark selects all rows

Question

When I use JDBC connection to feed spark, even if I use filtering on dataframe; when I inspect query log on my oracle datasource, I am seeing spark executing:

SELECT [column_names] FROM MY_TABLE

Referring to https://stackoverflow.com/a/40870714/1941560,

I was expecting spark lazily plan query and execute like;

SELECT [column_names] FROM MY_TABLE WHERE [filter_predicate]

But spark is not doing that. It takes all the data and filters afterwards. I need this behaviour because I don't want to retrieve all the table every x minutes but only changed rows (incremental filterin by UPDATE_DATE).

Is there a way to achieve this?

Here is my python code:

df = ...
lookup_seconds = 5 * 60;
now = datetime.datetime.now(pytz.timezone("some timezone"))
max_lookup_datetime = now - datetime.timedelta(seconds=lookup_seconds)
df.where(df.UPDATE_DATE > max_lookup_datetime).explain()

Explain result:

Physical Plan == *Filter (isnotnull(UPDATE_DATE#21) && (UPDATE_DATE#21 > 1516283483208806)) +- Scan ExistingRDD[NO#19,AMOUNT#20,UPDATE_DATE#21,CODE#22,AMOUNT_OLD#23]

EDIT: Complete answer is here

Guitao · Accepted Answer

From the official doc1:

dbtable The JDBC table that should be read. Note that anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses.

You could set the JDBC option dbtable to a subquery SQL. For example:

jdbcDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql:dbserver") \
    .option("dbtable", "(select * from tbl where UPDATE_DATE > max_lookup_datetime) t") \
    .option("user", "username") \
    .option("password", "password") \
    .load()

Apache Spark selects all rows

Answers (2)

Related Questions