Patrick Fürst
Patrick Fürst

Reputation: 11

PySpark Pushing down timestamp filter

I'm using PySpark version 2.4 to read some tables using jdbc with a Postgres driver.

    df = spark.read.jdbc(url=data_base_url, table="tablename", properties=properties)

One column is a timestamp column and I want to filter it like this:

    df_new_data = df.where(df.ts > last_datetime )

This way the filter is pushed down as a SQL query but the datetime format is not right. So I tried this approach

    df_new_data = df.where(df.ts > F.date_format( F.lit(last_datetime), "y-MM-dd'T'hh:mm:ss.SSS") )

but then the filter is no pushed down anymore.

Can someone clarify why this is the case ?

Upvotes: 0

Views: 332

Answers (1)

Lakshman Battini
Lakshman Battini

Reputation: 1912

While loading the data from a Database table, if you want to push down queries to database and get few result rows, instead of providing the 'table', you can provide the 'Query' and return just the result as a DataFrame. This way, we can leverage database engine to process the query and return only the results to Spark.

The table parameter identifies the JDBC table to read. You can use anything that is valid in a SQL query FROM clause. Note that alias is mandatory to be provided in query.

pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
df.show()

Upvotes: 1

Related Questions