Reputation: 141
On trying to convert pyspark dataframe into pandas dataframe using arrow function only half the rows are getting converted. Pyspark df contains 170,000 rows.
>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>> result_pdf = train_set.select("*").toPandas()
>> result_pdf returns only 65000 rows.
I tried to install and update pyarrow using following commands:
>> conda install -c conda-forge pyarrow
>> pip install pyarrow
>> pip install pyspark[sql]
and then run
>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>> result_pdf = train_set.select("*").toPandas()
>>spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>> result_pdf = train_set.select("*").toPandas()
I am getting following warning message each time while conversion:
C:\Users\MUM1342.conda\envs\snakes\lib\site-packages\pyarrow__init__.py:152: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " C:\Users\MUM1342.conda\envs\snakes\lib\site-packages\pyspark\sql\dataframe.py:2138: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.fallback.enabled' does not have an effect on failures in the middle of computation.
Actual Output:
> train_set.count
> 170256
> result_pdf.shape
> 6500
Expected Output:
> train_set.count
> 170256
> result_pdf.shape
> 170256
Upvotes: 2
Views: 3508
Reputation: 63
Please try below if it works
Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
Upvotes: 1