user84
user84

Reputation: 885

Data type during transferring data with ADF to Databricks tables converts into string

Big Picture: Netezza Tables (with integer values, with datatime values) -----> Databricks table with columns in string

Details:

I have an idea as following but it is not working

dataSchema = StructType([
                         StructField("col1", IntegerType()),
                         StructField("col2", LongType()),
                         StructField("col3", FloatType()),
                         StructField("col4", DoubleType()),
                         StructField("col5", StringType()),
                         StructField("col6", DateType()),
                         StructField("col7", TimeType()),
                         StructField("col8", ArrayType()),
                         StructField("col9", MapType()),
                        ])
df.write \
  .option("schema",dataSchema)
  ......
  .save()

Please help with you experience, on how I can enforce these table columns to desire data type

Upvotes: 0

Views: 425

Answers (1)

Jim Todd
Jim Todd

Reputation: 1588

You can use parquet format instead of CSV in the sink for the ADF pipeline. It will retain the datatype as in source, rather than string for all columns like in CSV. Also, parquet is good for you in couple of ways:

  1. You can also use some form of compression like snappy, to save some space in ADLS
  2. Easy with spark/databricks/Hive integration, as you mentioned in your qn.

A small comparison for you to understand. I tried with parquet and csv, and you can see the difference here. ADF pipeline sink: enter image description here

CSV sink (All columns as string) enter image description here

Parquet: ( Columns with equivalent format) enter image description here

Upvotes: 1

Related Questions