Reputation: 1333
I'm trying to convert a Pandas dataframe to a Pyspark dataframe, and getting the following pyarrow-related error:
import pandas as pd
import numpy as np
data = np.random.rand(1000000, 10)
pdf = pd.DataFrame(data, columns=list("abcdefghij"))
df = spark.createDataFrame(pdf)
/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py:714: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
'JavaPackage' object is not callable
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.
I've tried different versions of pyarrow (0.10.0, 0.14.1, 0.15.1 and more) but with the same result. How can I debug this?
Upvotes: 2
Views: 1179
Reputation: 3199
Can you try upgrading your pyspark to >= 3.0.0
? I had the above error with all versions of arrow, but bumping to the newer pyspark fixed it for me.
There is a version conflict with older versions of Spark (ex: 2.4.x) and newer versions of arrow.
Upvotes: 1
Reputation: 11
I had the same issue, changed the cluster setting to emr-5.30.1 and arrow version to 0.14.1 and it resolved the issue
Upvotes: 1