Reputation: 2083
I am trying to read a table from sqlserver and apply partitioning while reading. Before reading the data, I wanted to get bounds for lowerBound & upperBound as below.
boundsDF = spark.read.format('jdbc')
.option('url', 'url')
.option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver')
.option('user', username)
.option('password', password)
.option('dbtable', f'(select min(updated_datetime) as mint, max(updated_datetime) as maxt from tablename)
.load()
I extracted the values like below from boundsDF:
maxdate = [x["maxt"] for x in boundsDF.rdd.collect()]
mindate = [x["mint"] for x in boundsDF.rdd.collect()]
And this is how I specified the timestamp columns while reading:
dataframe = spark.read.format('jdbc')
.option('url', url)
.option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver')
.option('user', user)
.option('password', password)
.option('dbtable', tablename)
.option('partitionColumn', timestamp_column)
.option('numPartitions', 3)
.option('lowerBound', mindate[0])
.option('upperBound', maxdate[0])
.option('fetchsize', 5000)
.load()
If I print the values of mindate & maxdate below is how they look like:
mindate[0]: datetime.datetime(2010, 10, 4, 11, 54, 13, 543000)
maxdate[0]: datetime.datetime(2021, 3, 5, 17, 59, 45, 880000)
When I print dataframe.count()
, I see an exception message as below.
Exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 18.0 failed 1 times, most recent failure: Lost task 2.0 in stage 18.0 (TID 21, executor driver): com.microsoft.sqlserver.jdbc.SQLServerException: Conversion failed when converting date and/or time from character string.
Since I started using Spark, I have always used integer columns for my partition column. This is the first time I am using a timestamp column for partitioning the data.
Are mindate[0] & maxdate[0] in the right format to be specified in my read statement ? Could anyone let me know if I am implementing the code in right manner ?
Upvotes: 1
Views: 1721
Reputation: 83
I ran across the same issue, but after some experimentation, I was able to find the solution.
It seems there exists a bug which was raised about 2 years ago but sadly remains open to this date: https://issues.apache.org/jira/browse/SPARK-34097
However, instead of timestamp, if you pass just the date i.e. YYYY-MM-DD in lowerBound & upperBound, the code works.
Upvotes: 3
Reputation: 5131
The question is what datatype do you use in the SQL table ?
Now two remarks:
DATETIME2 is more accurate than DATETIME that is limited to 3 ms causing some queries to be badly interpreted
Upvotes: 1