Paul Ogier
Paul Ogier

Reputation: 157

spark error in column type

I have a dataframe column,called 'SupplierId' ,typed as a string, with a lot of digits, but also some characters chain. (ex: ['123','456','789',......,'abc']). I formatted this column as a string using

from pyspark.sql.types import StringType
df=df.withColumn('SupplierId',df['SupplierId'].cast(StringType())

So I check it is treated as a string using:

df.printSchema()

and I get:

root
 |-- SupplierId: string (nullable = true)

But when I try to convert to Pandas, or just to use df.collect(), I obtain the following error:

An error occurred while calling o516.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure:

Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 11, servername.ops.somecompany.local, executor 3): 
ava.lang.RuntimeException: Error while encoding: java.lang.RuntimeException:
Exception parsing 'CPD160001' into a IntegerType$ for column "SupplierId":
Unable to deserialize value using com.somecompany.spark.parsers.text.converters.IntegerConverter.
The value being deserialized was: CPD160001 

So it seems Spark treats the value of this column as integers. I have tried using UDF to force convert to string with python, but it still doesn't work. Do you have any idea what could cause this error?

Upvotes: 3

Views: 1094

Answers (1)

desertnaut
desertnaut

Reputation: 60321

Please do share a sample of your actual data, as your issue cannot be reproduced with toy ones:

spark.version
#  u'2.2.0'

from pyspark.sql import Row
df = spark.createDataFrame([Row(1, 2, '3'),
                            Row(4, 5, 'a'),
                            Row(7, 8, '9')],
                            ['x1', 'x2', 'id'])  

df.printSchema()
# root
#  |-- x1: long (nullable = true)
#  |-- x2: long (nullable = true) 
#  |-- id: string (nullable = true) 

df.collect()
# [Row(x1=1, x2=2, id=u'3'), Row(x1=4, x2=5, id=u'a'), Row(x1=7, x2=8, id=u'9')]

import pandas as pd
df_pandas = df.toPandas()
df_pandas
#   x1 x2 id 
# 0  1  2  3
# 1  4  5  a
# 2  7  8  9

Upvotes: 1

Related Questions