Spark SQL ignoring dynamic partition filter value

Question

Running into an issue on Spark 2.4 on EMR 5.20 in AWS.

I have a string column as a partition, which has date values. My goal is to have the max value of this column be referenced as a filter. The values look like this 2019-01-01 for January 1st, 2019.

In this query, I am trying to filter to a certain date value (which is a string data type), and Spark ends up reading all directories, not just the resulting max(value).

spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= (select max(mypartitioncolumn) from myothertable) group by 1,2,3 ").show

However, in this instance, If I hardcode the value, it only reads the proper directory.

spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= '2019-01-01' group by 1,2,3 ").show

Why is Spark not recognizing both methods in the same way? I made sure that if I run the select max(mypartitioncolumn) from myothertable query, it shows the exact same value as my hardcoded method (as well as the same datatype).

I can't find anything in the documentation that differentiates partition querying other than data type differences. I checked to make sure that my schema in both the source table as well as value are string types, and also tried to cast my value as a string as well cast( (select max(mypartitioncolumn) from myothertable) as string), it doesn't make any difference.

Spark SQL ignoring dynamic partition filter value

Answers (1)

Related Questions