Reputation: 10662
Spark version: 2.4.2 on Amazon EMR 5.24.0
I have a Glue Catalog table backed by S3 Parquet directory. The Parquet files have case-sensitive column names (like lastModified
). It doesn't matter what I do, I get lowercase column names (lastmodified
) when reading the Glue Catalog table with Spark:
for {
i <- Seq(false, true)
j <- Seq("NEVER_INFER", "INFER_AND_SAVE", "INFER_ONLY")
k <- Seq(false, true)
} {
val spark = SparkSession.builder()
.config("spark.sql.hive.convertMetastoreParquet", i)
.config("spark.sql.hive.caseSensitiveInferenceMode", j)
.config("spark.sql.parquet.mergeSchema", k)
.enableHiveSupport()
.getOrCreate()
import spark.sql
val df = sql("""SELECT * FROM ecs_db.test_small""")
df.columns.foreach(println)
}
[1] https://medium.com/@an_chee/why-using-mixed-case-field-names-in-hive-spark-sql-is-a-bad-idea-95da8b6ec1e0
[2] https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
Upvotes: 2
Views: 3991
Reputation: 10662
The below solution is incorrect.
spark.sql.sources.schema.*
properties, but Spark SQL should. The default in Spark 2.4 for spark.sql.hive.caseSensitiveInferenceMode
is INFER_AND_SAVE
which means that Spark infers the schema from the underlying files and alters the tables to add the spark.sql.sources.schema.*
properties to SERDEPROPERTIES
. In our case, Spark failed to do so, because of a IllegalArgumentException: Can not create a Path from an empty string
exception which is caused because the Hive database class instance has an empty locationUri
property string. This is caused because the Glue database does not have a Location property INFER_ONLY
, which should only infer the schema from the files and not attempt to alter the table SERDEPROPERTIES
. However, this doesn't work because of a Spark bug, where the inferred schema is then lowercased (see here).This bug happens because the Glue table's SERDEPROPERTIES
is missing two important properties:
spark.sql.sources.schema.numParts
spark.sql.sources.schema.part.0
To solve the problem, I had to add those two properties via the Glue console (couldn't do it with ALTER TABLE …
)
I guess this is a bug with Glue crawlers, which do not set these properties when creating the table.
Upvotes: 2