Sparrow  Jack
Sparrow Jack

Reputation: 45

PySpark on Jupyter Notebook, dataframe of two rows can't be converted to pandas dataframe. Why?

This is the pyspark dataframe

enter image description here enter image description here And the schema of the dataframe. Just two rows.

Then I want to convert it to pandas dataframe. enter image description here

But it is suspended at stage 3. No result, and no information about the procedure. Why this can happen?

enter image description here

And when I use pandas_api, the result is the same. enter image description here

Why this could happen? It bothers me the whole day.

Could anyone help me?

This is the package version.

enter image description here

Upvotes: 0

Views: 39

Answers (2)

Sparrow  Jack
Sparrow Jack

Reputation: 45

After trying and trying again, I found that the reason is that although runs in local mode, the source dir contains several parquet files. Then I need to convert it to rdd and coalesce into one partition. Then convert the rdd into pyspark dataframe. Then the pandas_api works fine. Wish this answer can help someone who met the same problem as me.

enter image description here

Upvotes: 0

Akhaya Chandan Mishra
Akhaya Chandan Mishra

Reputation: 72

try using this in first cell of notebook

import findspark

findspark.init()

findspark.find()

this will initialize the spark in jupiter notebook

enter image description here

Upvotes: 0

Related Questions