Neil
Neil

Reputation: 211

Spark Read.json cant find file

Hey I all I have 1 Master and 1 Slave Node Standalone Spark Cluster on AWS. I have a folder my home directory called ~/Notebooks. This is were I launch jupyter notebooks and connect jupyter in my browser. I also have a file in there called people.json (simple json file).

I try running this code

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf().setAppName('Practice').setMaster('spark://ip-172-31-2-186:7077')
sc = SparkContext(conf=conf)

sqlContext = SQLContext(sc)

df = sqlContext.read.json("people.json")

I get this error when i run that last line. I don't get it the file is right there... Any Ideas?-

Py4JJavaError: An error occurred while calling o238.json. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 4.0 failed 4 times, most recent failure: Lost task 1.3 in stage 4.0 (TID 37, ip-172-31-7-160.us-west-2.compute.internal): java.io.FileNotFoundException: File file:/home/ubuntu/Notebooks/people.json does not exist

Upvotes: 0

Views: 1945

Answers (1)

Aravind Yarram
Aravind Yarram

Reputation: 80194

Make sure the file is available on the worker nodes. Best way is to use a shared files system (NFS, HDFS). Read External Datasets documentation

Upvotes: 1

Related Questions