Reputation: 193
I am attempting to set up a readStream using autoloader in pyspark databricks:
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("inferSchema", True) \
.option("cloudFiles.schemaLocation", schema_path) \
.option("cloudFiles.schemaHints", "col1 string, col2 timestamp, col3 timestamp, col4 timestamp, col5 timestamp, col6 int, col7 MAP<STRING,STRING>, col8 MAP<STRING,STRING>, col9 MAP<STRING,STRING>, col10 MAP<STRING,STRING>, col11 MAP<STRING,STRING>, col12 MAP<STRING,STRING>, col13 MAP<STRING,STRING>") \
.option("cloudFiles.schemaEvolutionMode", "rescue") \
.load(raw_path_df) \
.writeStream \
.option("checkpointLocation", checkpoint_path) \
.trigger(once=True)\
.toTable(bronze_tbl)
However, I keep getting java.lang.Exception: Unsupported type: map<string,string>
Not sure why this is happening? I have used autoloader to read in data countless times before, and have used the Map() type as a schema hint. Not sure what I am missing here.
The code in the readStream above works as soon as I remove the schema hint param.
Upvotes: 2
Views: 632
Reputation: 87259
This happens because CSV by definition doesn't support complex types - only strings, numbers, ... Otherwise what data representation for type should be used? JSON-encoded, or something custom?
If you have your data encoded as JSON, then you simply need to apply from_json
to the corresponding columns.
Upvotes: 2