Plastic Soul
Plastic Soul

Reputation: 494

Save dataframe as Parquet not working in Pyspark

I used Spark SQL with Pyspark to create a dataframe df from a table on a SQL Server.

df.printSchema()
root
 |-- DATE1: date (nullable = true)
 |-- ID: decimal (nullable = false)
 |-- CODE: string (nullable = true)
 |-- DATE2: timestamp (nullable = true)

which is correct, and

type(df)
<class 'pyspark.sql.dataframe.DataFrame'>

which also looks good.

Now I'd like to save the table as a a parquet file, which should be straightforward, but which is causing me problems with an Unsupported datatype DecimalType() error:

df.save("test.parquet")

I get this error:

    Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user1/spark/python/pyspark/sql/dataframe.py", line 209, in save
    self._jdf.save(source, jmode, joptions)
  File "/home/user1/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/home/user1/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o39.save.
: java.lang.RuntimeException: Unsupported datatype DecimalType()
    at scala.sys.package$.error(package.scala:27)
    at ... 

I found this but that doesn't describe what I'm dealing with. This table just has run-of-the-mill decimal numbers. Anyone know what's happening? Thanks.

Upvotes: 2

Views: 4387

Answers (1)

Ken Geis
Ken Geis

Reputation: 922

I believe that the link you found is correct, to be fixed by SPARK-4176 in Spark 1.4.0.

Your ID field is probably defined as a very wide decimal. In Oracle, if you do not specify the scale and precision, you are given a 38-digit decimal. This leads to the same error you are seeing in your example.

Update It turns out that when Spark SQL loads a table, it discards the precision info on decimal fields from the database. Decimal fields are considered unlimited precision, thus triggering SPARK-4176. The symptom of this problem should go away in Spark 1.4, but I'll try to get a JIRA together about the cause.

Update Created issue SPARK-7196.

Upvotes: 3

Related Questions