Ilya P
Ilya P

Reputation: 187

Spark SQL ignoring dynamic partition filter value

Running into an issue on Spark 2.4 on EMR 5.20 in AWS.

I have a string column as a partition, which has date values. My goal is to have the max value of this column be referenced as a filter. The values look like this 2019-01-01 for January 1st, 2019.

In this query, I am trying to filter to a certain date value (which is a string data type), and Spark ends up reading all directories, not just the resulting max(value).

spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= (select max(mypartitioncolumn) from myothertable) group by 1,2,3 ").show

However, in this instance, If I hardcode the value, it only reads the proper directory.

spark.sql("select mypartitioncolumn, column1, column2 from mydatabase.mytable where mypartitioncolumn= '2019-01-01' group by 1,2,3 ").show

Why is Spark not recognizing both methods in the same way? I made sure that if I run the select max(mypartitioncolumn) from myothertable query, it shows the exact same value as my hardcoded method (as well as the same datatype).

I can't find anything in the documentation that differentiates partition querying other than data type differences. I checked to make sure that my schema in both the source table as well as value are string types, and also tried to cast my value as a string as well cast( (select max(mypartitioncolumn) from myothertable) as string), it doesn't make any difference.

Upvotes: 2

Views: 1729

Answers (1)

Ilya P
Ilya P

Reputation: 187

Workaround by changing configuration

sql("set spark.sql.hive.convertMetastoreParquet = false")

Spark docs

"When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the spark.sql.hive.convertMetastoreParquet configuration, and is turned on by default."

Upvotes: 1

Related Questions