Reputation: 45
This is the pyspark dataframe
And the schema of the dataframe. Just two rows.
Then I want to convert it to pandas dataframe.
But it is suspended at stage 3. No result, and no information about the procedure. Why this can happen?
And when I use pandas_api, the result is the same.
Why this could happen? It bothers me the whole day.
Could anyone help me?
This is the package version.
Upvotes: 0
Views: 39
Reputation: 45
After trying and trying again, I found that the reason is that although runs in local mode, the source dir contains several parquet files. Then I need to convert it to rdd and coalesce into one partition. Then convert the rdd into pyspark dataframe. Then the pandas_api works fine. Wish this answer can help someone who met the same problem as me.
Upvotes: 0
Reputation: 72
try using this in first cell of notebook
import findspark
findspark.init()
findspark.find()
this will initialize the spark in jupiter notebook
Upvotes: 0