Pyspark Dataframe not returning all rows while converting to pandas using toPandas or Pyarrow function in Pyspark

Question

On trying to convert pyspark dataframe into pandas dataframe using arrow function only half the rows are getting converted. Pyspark df contains 170,000 rows.

>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>> result_pdf = train_set.select("*").toPandas()
>> result_pdf returns only 65000 rows.

I tried to install and update pyarrow using following commands:

>> conda install -c conda-forge pyarrow
>> pip install pyarrow
>> pip install pyspark[sql]

and then run

>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>> result_pdf = train_set.select("*").toPandas()
>>spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>> result_pdf = train_set.select("*").toPandas()

I am getting following warning message each time while conversion:

C:\Users\MUM1342.conda\envs\snakes\lib\site-packages\pyarrow__init__.py:152: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " C:\Users\MUM1342.conda\envs\snakes\lib\site-packages\pyspark\sql\dataframe.py:2138: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.fallback.enabled' does not have an effect on failures in the middle of computation.

Actual Output:

> train_set.count
> 170256
> result_pdf.shape
> 6500

Expected Output:

> train_set.count
> 170256
> result_pdf.shape
> 170256

Pyspark Dataframe not returning all rows while converting to pandas using toPandas or Pyarrow function in Pyspark

Answers (1)

Related Questions