spark error in column type

Question

I have a dataframe column,called 'SupplierId' ,typed as a string, with a lot of digits, but also some characters chain. (ex: ['123','456','789',......,'abc']). I formatted this column as a string using

from pyspark.sql.types import StringType
df=df.withColumn('SupplierId',df['SupplierId'].cast(StringType())

So I check it is treated as a string using:

df.printSchema()

and I get:

root
 |-- SupplierId: string (nullable = true)

But when I try to convert to Pandas, or just to use df.collect(), I obtain the following error:

An error occurred while calling o516.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure:

Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 11, servername.ops.somecompany.local, executor 3): 
ava.lang.RuntimeException: Error while encoding: java.lang.RuntimeException:
Exception parsing 'CPD160001' into a IntegerType$ for column "SupplierId":
Unable to deserialize value using com.somecompany.spark.parsers.text.converters.IntegerConverter.
The value being deserialized was: CPD160001

So it seems Spark treats the value of this column as integers. I have tried using UDF to force convert to string with python, but it still doesn't work. Do you have any idea what could cause this error?

spark error in column type

Answers (1)

Related Questions