databricks autoloader use MAP() type as a schema hint

Question

I am attempting to set up a readStream using autoloader in pyspark databricks:

spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "csv") \
  .option("inferSchema", True) \
  .option("cloudFiles.schemaLocation", schema_path) \
  .option("cloudFiles.schemaHints", "col1 string, col2 timestamp, col3 timestamp, col4 timestamp, col5 timestamp, col6 int, col7 MAP, col8 MAP, col9 MAP, col10 MAP, col11 MAP, col12 MAP, col13 MAP") \
  .option("cloudFiles.schemaEvolutionMode", "rescue") \
  .load(raw_path_df) \
  .writeStream \
  .option("checkpointLocation", checkpoint_path) \
  .trigger(once=True)\
  .toTable(bronze_tbl)

However, I keep getting java.lang.Exception: Unsupported type: map

Not sure why this is happening? I have used autoloader to read in data countless times before, and have used the Map() type as a schema hint. Not sure what I am missing here.

The code in the readStream above works as soon as I remove the schema hint param.

Alex Ott · Accepted Answer

This happens because CSV by definition doesn't support complex types - only strings, numbers, ... Otherwise what data representation for type should be used? JSON-encoded, or something custom?

If you have your data encoded as JSON, then you simply need to apply from_json to the corresponding columns.

databricks autoloader use MAP() type as a schema hint

Answers (1)

Related Questions