alfonsomore
alfonsomore

Reputation: 27

fail DF show pyspark

!pip install Pyspark
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
pdf = pd.read_excel("xxxx.xlsx", sheet_name='Input (I)')
df = spark.createDataFrame(pdf)
df.show()

But get an error:

Py4JJavaError: An error occurred while calling o41.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (10.75.81.111 executor driver): org.apache.spark.SparkException: Python worker failed to connect back.

Upvotes: 0

Views: 378

Answers (1)

Luiz Viola
Luiz Viola

Reputation: 2416

Seems like related to the communication between PySpark and Python that might be solved by changing the environment variable's values:

Set Env PYSPARK_PYTHON=python

But, why don't you load the xlsx file directly on a PySpark DF? Something like:

df = spark.read.format("com.crealytics.spark.excel") \
                            .option("useHeader", "true") \
                            .option("inferSchema", "true") \
                            .option("dataAddress", "Input (I)") \
                            .load("xxxx.xlsx"))

Upvotes: 1

Related Questions