pyspark with pandas and pyarrow error on AWS EMR: 'JavaPackage' object is not callable

I'm trying to convert a Pandas dataframe to a Pyspark dataframe, and getting the following pyarrow-related error:

import pandas as pd
import numpy as np

data = np.random.rand(1000000, 10)
pdf = pd.DataFrame(data, columns=list("abcdefghij"))
df = spark.createDataFrame(pdf)

/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py:714: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
  'JavaPackage' object is not callable
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.

I've tried different versions of pyarrow (0.10.0, 0.14.1, 0.15.1 and more) but with the same result. How can I debug this?

Upvotes: 2

Answers (2)

K.S.

Reputation: 3199

Can you try upgrading your pyspark to >= 3.0.0? I had the above error with all versions of arrow, but bumping to the newer pyspark fixed it for me.

There is a version conflict with older versions of Spark (ex: 2.4.x) and newer versions of arrow.

Upvotes: 1

flyingqubit

Reputation: 11

I had the same issue, changed the cluster setting to emr-5.30.1 and arrow version to 0.14.1 and it resolved the issue

Upvotes: 1

pyspark with pandas and pyarrow error on AWS EMR: &#39;JavaPackage&#39; object is not callable

Answers (2)

Related Questions

pyspark with pandas and pyarrow error on AWS EMR: 'JavaPackage' object is not callable