Reputation: 329
While trying to load data into redshift from AWS S3, I am facing an issue with any column in the redshift table of type decimal. I am able to load non-decimal number in redshift, but can't able load datatype like Numeric(18,4).
DF schema in S3: A Integer, B string, C decimal(18,4), D timestamp
Redshift table schema: A INTEGER, B VARCHAR(20), C NUMERIC(18,4), D TIMESTAMP
Error Message from stl_load_errors table:
Invalid digit, Value '"', Pos 0, Type: Decimal
Data that redshift is trying to add:
2|sample_string||2021-04-03|
Why decimal column is coming as Empty or NULL?? AS you see above, the redshift data all data come in proper format except decimal column which is empty.
This is the code that I am using to load data into redshift from S3:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("App_name").getOrCreate()
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark.conf.set("spark.kryo.unsafe", "true")
spark.conf.set("spark.kryoserializer.buffer.max", "1024m")
df = spark.read.parquet(s3_input_path)
pre_query = """
begin;
create table {} (like {});
end;
""".format(temp_table,actual_table_name)
post_query = """
begin;
--some action
insert into {} select * from {};
end;
""".format(actual_table_name,temp_table)
df.write.format('com.databricks.spark.redshiftLib') \
.option("url", "jdbc:redshift://aws.redshift.amazonaws.sampleurl.com:5439/") \
.option("user","UserName") \
.option("preactions",pre_query) \
.option("password","Password") \
.option("dbtable","table_name" ) \
.option("extracopyoptions", "ACCEPTINVCHARS AS '?' TRUNCATECOLUMNS") \
.option("postactions",post_query) \
.options("tempformat", "CSV GZIP") \
.option("tempdir", "s3a://aws-bucket/") \
.option("csvseparator","|") \
.option("forward_spark_s3_credentials","true")\
.mode("append") \
.save()
Upvotes: 1
Views: 2714
Reputation: 329
I got the problem, I was using Spark 2.x. In order to save tempdir in CSV format, you need spark 3.x. You can use latest version, 3.0.0-preview1
.
You can upgrade your spark
or
you can use your command like spark-submit --packages com.databricks:spark-redshift_2.10:3.0.0-preview1....
Explanation:
When writing to Redshift, data is first stored in a temp folder in S3 before being loaded into Redshift. The default format used for storing temp data between Apache Spark and Redshift is Spark-Avro. However, Spark-Avro stores a decimal as a binary, which is interpreted by Redshift as empty strings or nulls.
But I want to improve performance and remove this blank issue, for that purpose CSV format is best suitable. I was using Spark 2.x which by default use Avro tempformat, even if we mention it externally.
So after giving 3.0.0-preview1 package with command, It can now use the features that are present in Spark 3.x.
Reference:
https://kb.databricks.com/data/redshift-fails-decimal-write.html
https://github.com/databricks/spark-redshift/issues/308
Upvotes: 0