PySpark on Jupyter Notebook, dataframe of two rows can't be converted to pandas dataframe. Why?

This is the pyspark dataframe

And the schema of the dataframe. Just two rows.

Then I want to convert it to pandas dataframe.

But it is suspended at stage 3. No result, and no information about the procedure. Why this can happen?

And when I use pandas_api, the result is the same.

Why this could happen? It bothers me the whole day.

Could anyone help me?

This is the package version.

Upvotes: 0

Answers (2)

Sparrow Jack

Reputation: 45

After trying and trying again, I found that the reason is that although runs in local mode, the source dir contains several parquet files. Then I need to convert it to rdd and coalesce into one partition. Then convert the rdd into pyspark dataframe. Then the pandas_api works fine. Wish this answer can help someone who met the same problem as me.